CN113793590A

CN113793590A - Speech synthesis method and device

Info

Publication number: CN113793590A
Application number: CN202010457474.4A
Authority: CN
Inventors: 别凡虎
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-05-26
Filing date: 2020-05-26
Publication date: 2021-12-14
Anticipated expiration: 2040-05-26
Also published as: CN113793590B

Abstract

The application is applicable to the technical field of terminal artificial intelligence, and provides a voice synthesis method and a voice synthesis device, wherein the method comprises the following steps: acquiring an identity code, wherein the identity code is used for indicating the identity of a third party performing voice conversion; determining the phoneme duration of each phoneme corresponding to the text to be converted, wherein the phoneme duration of at least one phoneme is determined according to the identity code; and converting the text to be converted according to the phoneme durations of the phonemes to obtain the voice data, so that the voice data can comprise a digital signature indicating the identity of a third party, namely the phoneme duration determined by the identity code, and the identity code can be determined through the phoneme duration to determine the identity of a third party company, thereby solving the problem that the identity of the third party synthesizing the voice data cannot be determined, and determining the identity of the third party according to the voice data under the condition of secondary transcription, thereby improving the stability of the digital signature.

Description

Speech synthesis method and device

Technical Field

The application belongs to the technical field of terminal artificial intelligence, and particularly relates to a voice synthesis method and device.

Background

With the continuous development of terminal devices, the terminal devices can synthesize voice data and add digital signatures to the synthesized voice data, so that the identity of a third party synthesizing the voice data can be determined.

In the related art, a third party may synthesize voice data using a voice synthesis technique, and add a digital signature representing the identity of the third party to the voice data during the synthesis of the voice data, so that the identity of the third party synthesizing the voice data can be determined based on the digital signature.

However, when the source file of the voice data cannot be acquired, the digital signature in the voice data cannot be acquired, and the identity of the third party synthesizing the voice data cannot be determined.

Disclosure of Invention

The embodiment of the application provides a voice synthesis method and a voice synthesis device, which can solve the problem that the identity of a third party synthesizing voice data cannot be determined.

In a first aspect, an embodiment of the present application provides a speech synthesis method, including:

acquiring an identity code, wherein the identity code is used for indicating the identity of a third party performing voice conversion;

determining the phoneme duration of each phoneme corresponding to the text to be converted, wherein the phoneme duration of at least one phoneme is determined according to the identity code;

and converting the text to be converted according to the phoneme duration of each phoneme to obtain voice data.

In a first possible implementation manner of the first aspect, the obtaining the identity code includes:

acquiring an identity signature of the third party based on preset configuration information;

and searching the identity code corresponding to the identity signature from the preset corresponding relation between the identity signature and the identity code.

In a second possible implementation manner of the first aspect, the determining a phoneme duration of each phoneme corresponding to the text to be converted includes:

inputting the text to be converted into a preset duration model to obtain the initial duration of each phoneme;

selecting at least one target phoneme from the phonemes corresponding to the text to be converted;

determining a duration increment of each target phoneme according to the identity codes;

determining the phoneme duration of each target phoneme according to the initial duration and the duration increment of each target phoneme;

and for each phoneme except the target phoneme in each phoneme corresponding to the text to be converted, determining the initial duration of the phoneme as the phoneme duration of the phoneme.

In a third possible implementation manner of the first aspect, based on the second possible implementation manner of the first aspect, the duration increment of each of the target phonemes is the same.

Based on the second possible implementation manner of the first aspect, in a fourth possible implementation manner of the first aspect, the determining, according to the identity code, a duration increment of each target phoneme includes:

and determining the duration increment of each target phoneme according to the identity code and a preset increment factor.

In a fifth possible implementation manner of the first aspect, the method further includes:

acquiring standard voice data and abnormal voice data;

respectively extracting each phoneme in the standard voice data and the abnormal voice data;

for each phoneme in the abnormal voice data, comparing the phoneme with a phoneme matched with the phoneme in the standard voice data to obtain a duration difference value corresponding to the phoneme;

and determining an identity signature according to each time length difference value.

Based on the fifth possible implementation manner of the first aspect, in a sixth possible implementation manner of the first aspect, the determining an identity signature according to each of the time length difference values includes:

determining an identity code corresponding to the abnormal voice data according to each time length difference value and a preset increment factor;

and determining the identity signature according to the identity code corresponding to the abnormal voice data and the preset corresponding relation between the identity signature and the identity code.

Based on the fifth possible implementation manner or the sixth possible implementation manner of the first aspect, in a seventh possible implementation manner of the first aspect, the separately extracting each phoneme in the standard speech data and the abnormal speech data includes:

and under the condition that the abnormal voice data is complete voice data, extracting each phoneme in the standard voice data and the abnormal voice data respectively.

In a second aspect, an embodiment of the present application provides a speech synthesis apparatus, including:

the system comprises a first acquisition module, a second acquisition module and a voice conversion module, wherein the first acquisition module is used for acquiring an identity code which is used for indicating the identity of a third party performing voice conversion;

a first determining module, configured to determine a phoneme duration of each phoneme corresponding to a text to be converted, where the phoneme duration of at least one phoneme is determined according to the identity code;

and the conversion module is used for converting the text to be converted according to the phoneme duration of each phoneme to obtain voice data.

In a first possible implementation manner of the second aspect, the first obtaining module is specifically configured to obtain an identity signature of the third party based on preset configuration information; and searching the identity code corresponding to the identity signature from the preset corresponding relation between the identity signature and the identity code.

In a second possible implementation manner of the second aspect, the first determining module is specifically configured to input the text to be converted into a preset duration model, so as to obtain an initial duration of each phoneme; selecting at least one target phoneme from the phonemes corresponding to the text to be converted; determining a duration increment of each target phoneme according to the identity codes; determining the phoneme duration of each target phoneme according to the initial duration and the duration increment of each target phoneme; and for each phoneme except the target phoneme in each phoneme corresponding to the text to be converted, determining the initial duration of the phoneme as the phoneme duration of the phoneme.

In a third possible implementation manner of the second aspect, based on the second possible implementation manner of the second aspect, the time increment of each of the target phonemes is the same.

Based on the second possible implementation manner of the second aspect, in a fourth possible implementation manner of the second aspect, the first determining module is further specifically configured to determine a duration increment of each target phoneme according to the identity code and a preset increment factor.

In a fifth possible implementation manner of the second aspect, the apparatus further includes:

the second acquisition module is used for acquiring standard voice data and abnormal voice data;

the extraction module is used for respectively extracting each phoneme in the standard voice data and the abnormal voice data;

the comparison module is used for comparing each phoneme in the abnormal voice data with a phoneme matched with the standard voice data to obtain a duration difference value corresponding to the phoneme;

and the second determining module is used for determining the identity signature according to each time length difference value.

Based on a fifth possible implementation manner of the second aspect, in a sixth possible implementation manner of the second aspect, the second determining module is specifically configured to determine, according to each of the duration difference values and a preset increment factor, an identity code corresponding to the abnormal voice data; and determining the identity signature according to the identity code corresponding to the abnormal voice data and the preset corresponding relation between the identity signature and the identity code.

Based on the fifth possible implementation manner or the sixth possible implementation manner of the second aspect, in a seventh possible implementation manner of the second aspect, the extracting module is specifically configured to extract each phoneme in the standard speech data and each phoneme in the abnormal speech data respectively when the abnormal speech data is complete speech data.

In a third aspect, an embodiment of the present application provides a terminal device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor, when executing the computer program, implements the speech synthesis method according to any one of the above first aspects.

In a fourth aspect, the present application provides a computer-readable storage medium, which stores a computer program, where the computer program is executed by a processor to implement the speech synthesis method according to any one of the above first aspects.

In a fifth aspect, the present application provides a computer program product, which when run on a terminal device, causes the terminal device to execute the speech synthesis method according to any one of the above first aspects.

In a sixth aspect, an embodiment of the present application provides a chip system, where the chip system includes a processor, the processor is coupled with a memory, and the processor executes a computer program stored in the memory to implement the speech synthesis method according to any one of the first aspect.

The chip system can be a single chip or a chip module consisting of a plurality of chips.

Compared with the prior art, the embodiment of the application has the advantages that:

according to the embodiment of the application, the identity code for indicating the identity of the third party performing voice conversion is obtained, and the phoneme duration of each phoneme corresponding to the text to be converted is determined according to the identity code, wherein the phoneme duration of at least one phoneme is determined according to the identity code, and then the text to be converted is converted according to the phoneme duration of each phoneme to obtain the voice data, so that the voice data can comprise the digital signature indicating the identity of the third party, namely the phoneme duration determined by the identity code, and the identity code can be determined through the phoneme duration, so that the identity of a third party company is determined.

Drawings

Fig. 1 is a scene schematic diagram of a speech synthesis scene related to a speech synthesis method according to an embodiment of the present application;

fig. 2 is a scene schematic diagram of a speech synthesis scene according to another speech synthesis method provided in the embodiment of the present application;

fig. 3 is a block diagram of a terminal device according to an embodiment of the present disclosure;

FIG. 4 is a schematic flow chart of a speech synthesis method provided by an embodiment of the present application;

FIG. 5 is a schematic flow chart diagram of another method for identifying an identity provided by an embodiment of the present application;

fig. 6 is a block diagram of a speech synthesis apparatus according to an embodiment of the present application;

fig. 7 is a block diagram of another speech synthesis apparatus according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

The terminology used in the following examples is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of this application and the appended claims, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, such as "one or more", unless the context clearly indicates otherwise. It should also be understood that in the embodiments of the present application, "one or more" means one, two, or more than two; "and/or" describes the association relationship of the associated objects, indicating that three relationships may exist; for example, a and/or B, may represent: a alone, both A and B, and B alone, where A, B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

The speech synthesis method provided by the embodiment of the application can be applied to terminal devices such as a mobile phone, a tablet personal computer, a wearable device, a vehicle-mounted device, an Augmented Reality (AR)/Virtual Reality (VR) device, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a Personal Digital Assistant (PDA), and the like, and the embodiment of the application does not limit the specific type of the terminal device at all.

For example, the terminal device may be a STATION (ST) in a WLAN, and may be a cellular phone, a cordless phone, a Session Initiation Protocol (SIP) phone, a Wireless Local Loop (WLL) STATION, a Personal Digital Assistant (PDA) device, a handheld device having Wireless communication capability, a computing device or other processing device connected to a Wireless modem, a vehicle mounted device, a vehicle networking terminal, a computer, a laptop computer, a handheld communication device, a handheld computing device, a satellite Wireless device, or the like.

By way of example and not limitation, when the terminal device is a wearable device, the wearable device may also be a generic term for intelligently designing daily wearing by applying wearable technology, developing wearable devices, such as glasses, gloves, watches, clothing, shoes, and the like. A wearable device is a portable device that is worn directly on the body or integrated into the clothing or accessories of the user. The wearable device is not only a hardware device, but also realizes powerful functions through software support, data interaction and cloud interaction. The generalized wearable intelligent device has the advantages that the generalized wearable intelligent device is complete in function and large in size, can realize complete or partial functions without depending on a smart phone, such as a smart watch or smart glasses, and only is concentrated on a certain application function, and needs to be matched with other devices such as the smart phone for use, such as various smart bracelets for monitoring physical signs, smart jewelry and the like.

Fig. 1 is a scene schematic diagram of a speech synthesis scene related to a speech synthesis method provided in an embodiment of the present application, and referring to fig. 1, the speech synthesis scene may include a terminal device 110, and the terminal device 110 may obtain a text to be converted, and generate speech data corresponding to a third party by combining with an identity code of the third party.

In a possible implementation manner, the terminal device 110 may obtain a text to be converted, determine an initial duration of each phoneme in the text to be converted through a pre-trained duration model, and obtain an identity signature for identifying a third party according to pre-stored configuration information, so that an identity code for generating voice data may be determined according to the identity signature.

Then, the terminal device 110 may calculate a duration increment according to the identity code, select at least one phoneme from a plurality of phonemes of the text to be converted as the target phoneme according to a preset selection manner, calculate a phoneme duration of the target phoneme according to the duration increment and the initial duration of the target phoneme, and convert the text to be converted according to the phoneme duration of the target phoneme and the phoneme durations of other phonemes to obtain the speech data corresponding to the third party.

The third party may be a provider of the terminal device 110, a developer of the application program in the terminal device 110, or another user who applies the speech synthesis method provided in this embodiment, which is not limited in this embodiment of the present application.

For example, if the terminal device 110 is an in-vehicle device, the in-vehicle device may generate a text to be converted based on a position of a vehicle during a navigation process, and generate voice data for indicating a driving path in combination with a third-party identity code corresponding to the in-vehicle device or a navigation application program.

In addition, a phoneme is a minimum voice unit divided according to natural attributes of voice, and is analyzed according to pronunciation actions in syllables, and one action constitutes one phoneme. For example, taking the pronunciation rule of pinyin as an example, the initial corresponding to the pinyin of each character may be taken as one element, and the final of the pinyin may be taken as another element, for example, in "weather", the elements corresponding to the characters "day" may include "t" and "ian", and the elements corresponding to the characters "qi" may include "q" and "i".

Of course, the speech synthesis method provided by the embodiment of the present application may also be applied to other terminal artificial intelligence fields, such as smart homes, wearable devices, and the like, and the embodiment of the present application does not limit this.

It should be noted that, in practical application, referring to fig. 2, a speech synthesis scenario may further include a server 120, and the terminal device 110 may be connected to the server 120, so that the server 120 may convert a text to be converted to obtain speech data.

Correspondingly, in the process of generating the voice data, the terminal device 110 may first send a text to be converted and configuration information to the server 120, and the server 120 may determine the initial durations of the phonemes according to the text to be converted, and then determine an identity code corresponding to the identity signature according to the identity signature in the configuration information and the correspondence between the stored identity signature and the identity code; calculating according to the identity codes to obtain duration increments, and calculating according to the duration increments to obtain phoneme durations of the target phonemes in the phonemes; and finally, converting the text to be converted according to the phoneme duration of each target phoneme and the initial durations of other phonemes to generate voice data.

For simplicity, the following embodiments are only described by taking an example that the terminal device 110 is included in the speech synthesis scenario and the server 120 is not included, but in practical applications, the terminal device 110 may convert the speech data into the speech data, and the server 120 may also convert the speech data into the speech data, which is not limited in this application.

Fig. 3 is a block diagram of a terminal device according to an embodiment of the present disclosure. Referring to fig. 3, the terminal device may include a processor 310, an external memory interface 320, an internal memory 321, a Universal Serial Bus (USB) interface 330, a charging management module 340, a power management module 341, a battery 342, an antenna 1, an antenna 2, a mobile communication module 350, a wireless communication module 360, an audio module 370, a speaker 370A, a receiver 370B, a microphone 370C, an earphone interface 370D, a sensor module 380, a button 390, a motor 391, an indicator 392, a camera 393, a display 394, and a Subscriber Identification Module (SIM) card interface 395, and the like. The sensor module 380 may include a pressure sensor 380A, a gyroscope sensor 380B, an air pressure sensor 380C, a magnetic sensor 380D, an acceleration sensor 380E, a distance sensor 380F, a proximity light sensor 380G, a fingerprint sensor 380H, a temperature sensor 380J, a touch sensor 380K, an ambient light sensor 380L, a bone conduction sensor 380M, and the like.

It is to be understood that the illustrated structure of the embodiment of the present invention does not specifically limit the terminal device. In other embodiments of the present application, a terminal device may include more or fewer components than shown, or some components may be combined, some components may be split, or a different arrangement of components may be used. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

Processor 310 may include one or more processing units, such as: the processor 310 may include an Application Processor (AP), a modem processor, a Graphics Processing Unit (GPU), an Image Signal Processor (ISP), a controller, a memory, a video codec, a Digital Signal Processor (DSP), a baseband processor, and/or a neural-Network Processing Unit (NPU), etc. The different processing units may be separate devices or may be integrated into one or more processors.

The controller can be a neural center and a command center of the terminal equipment. The controller can generate an operation control signal according to the instruction operation code and the timing signal to complete the control of instruction fetching and instruction execution.

A memory may also be provided in the processor 310 for storing instructions and data. In some embodiments, the memory in the processor 310 is a cache memory. The memory may hold instructions or data that have just been used or recycled by the processor 310. If the processor 310 needs to reuse the instruction or data, it can be called directly from the memory. Avoiding repeated accesses reduces the latency of the processor 310, thereby increasing the efficiency of the system.

In some embodiments, processor 310 may include one or more interfaces. The interface may include an integrated circuit (I2C) interface, an integrated circuit built-in audio (I2S) interface, a Pulse Code Modulation (PCM) interface, a universal asynchronous receiver/transmitter (UART) interface, a Mobile Industry Processor Interface (MIPI), a general-purpose input/output (GPIO) interface, a Subscriber Identity Module (SIM) interface, and/or a Universal Serial Bus (USB) interface, etc.

The I2C interface is a bi-directional synchronous serial bus that includes a serial data line (SDA) and a Serial Clock Line (SCL). In some embodiments, the processor 310 may include multiple sets of I2C buses. The processor 310 may be coupled to the touch sensor 380K, the charger, the flash, the camera 393, etc., via different I2C bus interfaces. For example: the processor 310 may be coupled to the touch sensor 380K through an I2C interface, so that the processor 310 and the touch sensor 380K communicate through an I2C bus interface, thereby implementing the touch function of the terminal device.

The I2S interface may be used for audio communication. In some embodiments, the processor 310 may include multiple sets of I2S buses. The processor 310 may be coupled to the audio module 370 via an I2S bus to enable communication between the processor 310 and the audio module 370. In some embodiments, the audio module 370 may communicate audio signals to the wireless communication module 360 via an I2S interface, enabling answering of calls via a bluetooth headset.

The PCM interface may also be used for audio communication, sampling, quantizing and encoding analog signals. In some embodiments, the audio module 370 and the wireless communication module 360 may be coupled by a PCM bus interface. In some embodiments, the audio module 370 may also transmit audio signals to the wireless communication module 360 through the PCM interface, so as to implement a function of answering a call through a bluetooth headset. Both the I2S interface and the PCM interface may be used for audio communication.

The UART interface is a universal serial data bus used for asynchronous communications. The bus may be a bidirectional communication bus. It converts the data to be transmitted between serial communication and parallel communication. In some embodiments, a UART interface is generally used to connect the processor 310 with the wireless communication module 360. For example: the processor 310 communicates with the bluetooth module in the wireless communication module 360 through the UART interface to implement the bluetooth function. In some embodiments, the audio module 370 may transmit the audio signal to the wireless communication module 360 through a UART interface, so as to realize the function of playing music through a bluetooth headset.

The MIPI interface may be used to connect processor 310 with peripheral devices such as display 394, camera 393, and the like. The MIPI interface includes a Camera Serial Interface (CSI), a Display Serial Interface (DSI), and the like. In some embodiments, processor 310 and camera 393 communicate over a CSI interface to implement the capture functionality of the terminal device. The processor 310 and the display screen 394 communicate through a DSI interface to realize the display function of the terminal device.

The GPIO interface may be configured by software. The GPIO interface may be configured as a control signal and may also be configured as a data signal. In some embodiments, a GPIO interface may be used to connect the processor 310 with the camera 393, the display 394, the wireless communication module 360, the audio module 370, the sensor module 380, and the like. The GPIO interface may also be configured as an I2C interface, an I2S interface, a UART interface, a MIPI interface, and the like.

The USB interface 330 is an interface conforming to the USB standard specification, and may specifically be a Mini USB interface, a Micro USB interface, a USB Type C interface, or the like. The USB interface 330 may be used to connect a charger to charge the terminal device, or may be used to transmit data between the terminal device and the peripheral device. And the earphone can also be used for connecting an earphone and playing audio through the earphone. The interface may also be used to connect other electronic devices, such as AR devices and the like.

It should be understood that the interface connection relationship between the modules in the embodiment of the present invention is only an exemplary illustration, and does not form a structural limitation on the terminal device. In other embodiments of the present application, the terminal device may also adopt different interface connection manners or a combination of multiple interface connection manners in the foregoing embodiments.

The charging management module 340 is configured to receive charging input from a charger. The charger may be a wireless charger or a wired charger. In some wired charging embodiments, the charging management module 340 may receive charging input from a wired charger via the USB interface 330. In some wireless charging embodiments, the charging management module 340 may receive a wireless charging input through a wireless charging coil of the terminal device. The charging management module 340 may also supply power to the electronic device through the power management module 341 while charging the battery 342.

The power management module 341 is configured to connect the battery 342, the charging management module 340 and the processor 310. The power management module 341 receives input from the battery 342 and/or the charge management module 340 and provides power to the processor 310, the internal memory 321, the external memory, the display 394, the camera 393, and the wireless communication module 360. The power management module 341 may also be configured to monitor parameters such as battery capacity, battery cycle count, and battery state of health (leakage, impedance). In other embodiments, the power management module 341 may also be disposed in the processor 310. In other embodiments, the power management module 341 and the charging management module 340 may be disposed in the same device.

The wireless communication function of the terminal device can be realized by the antenna 1, the antenna 2, the mobile communication module 350, the wireless communication module 360, the modem processor, the baseband processor, and the like.

The antennas 1 and 2 are used for transmitting and receiving electromagnetic wave signals. Each antenna in a terminal device may be used to cover a single or multiple communication bands. Different antennas can also be multiplexed to improve the utilization of the antennas. For example: the antenna 1 may be multiplexed as a diversity antenna of a wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.

The mobile communication module 350 may provide a solution including 2G/3G/4G/5G wireless communication applied on the terminal device. The mobile communication module 350 may include at least one filter, a switch, a power amplifier, a Low Noise Amplifier (LNA), and the like. The mobile communication module 350 may receive the electromagnetic wave from the antenna 1, filter, amplify, etc. the received electromagnetic wave, and transmit the filtered electromagnetic wave to the modem processor for demodulation. The mobile communication module 350 may also amplify the signal modulated by the modem processor, and convert the signal into electromagnetic wave through the antenna 1 to radiate the electromagnetic wave. In some embodiments, at least some of the functional modules of the mobile communication module 350 may be disposed in the processor 310. In some embodiments, at least some of the functional modules of the mobile communication module 350 may be disposed in the same device as at least some of the modules of the processor 310.

The modem processor may include a modulator and a demodulator. The modulator is used for modulating a low-frequency baseband signal to be transmitted into a medium-high frequency signal. The demodulator is used for demodulating the received electromagnetic wave signal into a low-frequency baseband signal. The demodulator then passes the demodulated low frequency baseband signal to a baseband processor for processing. The low frequency baseband signal is processed by the baseband processor and then transferred to the application processor. The application processor outputs sound signals through an audio device (not limited to the speaker 370A, the receiver 370B, etc.) or displays images or video through the display 394. In some embodiments, the modem processor may be a stand-alone device. In other embodiments, the modem processor may be separate from the processor 310, and may be disposed in the same device as the mobile communication module 350 or other functional modules.

The wireless communication module 360 may provide solutions for wireless communication applied to the terminal device, including Wireless Local Area Networks (WLANs) (e.g., wireless fidelity (Wi-Fi) networks), Bluetooth (BT), Global Navigation Satellite System (GNSS), Frequency Modulation (FM), Near Field Communication (NFC), Infrared (IR), and the like. The wireless communication module 360 may be one or more devices integrating at least one communication processing module. The wireless communication module 360 receives electromagnetic waves via the antenna 2, performs frequency modulation and filtering processing on electromagnetic wave signals, and transmits the processed signals to the processor 310. The wireless communication module 360 may also receive a signal to be transmitted from the processor 310, frequency-modulate and amplify the signal, and convert the signal into electromagnetic waves via the antenna 2 to radiate the electromagnetic waves.

In some embodiments, the terminal device's antenna 1 is coupled to the mobile communication module 350 and antenna 2 is coupled to the wireless communication module 360 so that the terminal device can communicate with the network and other devices via wireless communication techniques. The wireless communication technology may include global system for mobile communications (GSM), General Packet Radio Service (GPRS), code division multiple access (code division multiple access, CDMA), Wideband Code Division Multiple Access (WCDMA), time-division code division multiple access (time-division code division multiple access, TD-SCDMA), Long Term Evolution (LTE), LTE, BT, GNSS, WLAN, NFC, FM, and/or IR technologies, etc. The GNSS may include a Global Positioning System (GPS), a global navigation satellite system (GLONASS), a beidou navigation satellite system (BDS), a quasi-zenith satellite system (QZSS), and/or a Satellite Based Augmentation System (SBAS).

The terminal device implements the display function through the GPU, the display screen 394, and the application processor, etc. The GPU is an image processing microprocessor coupled to a display 394 and an application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. The processor 310 may include one or more GPUs that execute program instructions to generate or alter display information.

The display screen 394 is used to display images, video, and the like. The display screen 394 includes a display panel. The display panel may adopt a Liquid Crystal Display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode (active-matrix organic light-emitting diode, AMOLED), a flexible light-emitting diode (FLED), a miniature, a Micro-oeld, a quantum dot light-emitting diode (QLED), and the like. In some embodiments, the terminal device may include 1 or N display screens 394, N being a positive integer greater than 1.

The terminal device may implement the shooting function through the ISP, the camera 393, the video codec, the GPU, the display 394, the application processor, and the like.

The ISP is used to process the data fed back by the camera 393. For example, when a photo is taken, the shutter is opened, light is transmitted to the camera photosensitive element through the lens, the optical signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to the ISP for processing and converting into an image visible to naked eyes. The ISP can also carry out algorithm optimization on the noise, brightness and skin color of the image. The ISP can also optimize parameters such as exposure, color temperature and the like of a shooting scene. In some embodiments, the ISP may be located in camera 393.

Camera 393 is used to capture still images or video. The object generates an optical image through the lens and projects the optical image to the photosensitive element. The photosensitive element may be a Charge Coupled Device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The light sensing element converts the optical signal into an electrical signal, which is then passed to the ISP where it is converted into a digital image signal. And the ISP outputs the digital image signal to the DSP for processing. The DSP converts the digital image signal into image signal in standard RGB, YUV and other formats. In some embodiments, the terminal device may include 1 or N cameras 393, N being a positive integer greater than 1.

The digital signal processor is used for processing digital signals, and can process digital image signals and other digital signals. For example, when the terminal device selects the frequency point, the digital signal processor is used for performing fourier transform and the like on the frequency point energy.

Video codecs are used to compress or decompress digital video. The terminal device may support one or more video codecs. In this way, the terminal device can play or record videos in a plurality of coding formats, such as: moving Picture Experts Group (MPEG) 1, MPEG2, MPEG3, MPEG4, and the like.

The NPU is a neural-network (NN) computing processor that processes input information quickly by using a biological neural network structure, for example, by using a transfer mode between neurons of a human brain, and can also learn by itself continuously. The NPU can realize the intelligent cognition and other applications of the terminal equipment, such as: image recognition, face recognition, speech recognition, text understanding, and the like.

The external memory interface 320 may be used to connect an external memory card, such as a Micro SD card, to extend the storage capability of the terminal device. The external memory card communicates with the processor 310 through the external memory interface 320 to implement a data storage function. For example, files such as music, video, etc. are saved in an external memory card.

The internal memory 321 may be used to store computer-executable program code, which includes instructions. The processor 310 executes various functional applications of the terminal device and data processing by executing instructions stored in the internal memory 321. The internal memory 321 may include a program storage area and a data storage area. The storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required by at least one function, and the like. The storage data area may store data (such as audio data, a phonebook, etc.) created during use of the terminal device, and the like. In addition, the internal memory 321 may include a high-speed random access memory, and may further include a nonvolatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (UFS), and the like.

The terminal device may implement an audio function through the audio module 370, the speaker 370A, the receiver 370B, the microphone 370C, the earphone interface 370D, and the application processor. Such as music playing, recording, etc.

The audio module 370 is used to convert digital audio information into an analog audio signal output and also to convert an analog audio input into a digital audio signal. The audio module 370 may also be used to encode and decode audio signals. In some embodiments, the audio module 370 may be disposed in the processor 310, or some functional modules of the audio module 370 may be disposed in the processor 310.

The speaker 370A, also called a "horn", is used to convert the audio electrical signal into an acoustic signal. The terminal device can listen to music through the speaker 370A or listen to a handsfree call.

The receiver 370B, also called "earpiece", is used to convert the electrical audio signal into an acoustic signal. When the terminal device answers a call or voice information, it is possible to answer a voice by placing the receiver 370B close to the human ear.

Microphone 370C, also known as a "microphone," is used to convert sound signals into electrical signals. When making a call or transmitting voice information, the user can input a voice signal into the microphone 370C by speaking the user's mouth near the microphone 370C. The terminal device may be provided with at least one microphone 370C. In other embodiments, the terminal device may be provided with two microphones 370C, which may implement a noise reduction function in addition to collecting sound signals. In other embodiments, the terminal device may further include three, four, or more microphones 370C to collect sound signals, reduce noise, identify sound sources, and implement directional recording functions.

The headphone interface 370D is used to connect wired headphones. The headset interface 370D may be the USB interface 330, or may be a 3.5mm open mobile electronic device platform (OMTP) standard interface, a cellular telecommunications industry association (cellular telecommunications industry association of the USA, CTIA) standard interface.

The pressure sensor 380A is used for sensing a pressure signal, and converting the pressure signal into an electrical signal. In some embodiments, the pressure sensor 380A may be disposed on the display screen 394. The pressure sensor 380A can be of a wide variety, such as a resistive pressure sensor, an inductive pressure sensor, a capacitive pressure sensor, or the like. The capacitive pressure sensor may be a sensor comprising at least two parallel plates having an electrically conductive material. When a force acts on the pressure sensor 380A, the capacitance between the electrodes changes. The terminal device determines the intensity of the pressure from the change in capacitance. When a touch operation is applied to the display screen 394, the terminal device detects the intensity of the touch operation according to the pressure sensor 380A. The terminal device can also calculate the touched position from the detection signal of the pressure sensor 380A. In some embodiments, the touch operations that are applied to the same touch position but different touch operation intensities may correspond to different operation instructions. For example: and when the touch operation with the touch operation intensity smaller than the first pressure threshold value acts on the short message application icon, executing an instruction for viewing the short message. And when the touch operation with the touch operation intensity larger than or equal to the first pressure threshold value acts on the short message application icon, executing an instruction of newly building the short message.

The gyro sensor 380B may be used to determine the motion attitude of the terminal device. In some embodiments, the angular velocity of the terminal device about three axes (i.e., the x, y, and z axes) may be determined by gyroscope sensor 380B. The gyro sensor 380B may be used for photographing anti-shake. Illustratively, when the shutter is pressed, the gyroscope sensor 380B detects the shake angle of the terminal device, calculates the distance to be compensated for by the lens module according to the shake angle, and enables the lens to counteract the shake of the terminal device through reverse movement, thereby achieving anti-shake. The gyro sensor 380B may also be used for navigation, somatosensory gaming scenes.

The air pressure sensor 380C is used to measure air pressure. In some embodiments, the terminal device calculates altitude, aiding positioning and navigation, from the barometric pressure values measured by barometric pressure sensor 380C.

The magnetic sensor 380D includes a hall sensor. The terminal device may detect the opening and closing of the flip holster using the magnetic sensor 380D. In some embodiments, when the terminal device is a flip, the terminal device may detect the opening and closing of the flip according to the magnetic sensor 380D. And then according to the opening and closing state of the leather sheath or the opening and closing state of the flip cover, the automatic unlocking of the flip cover is set.

The acceleration sensor 380E can detect the magnitude of the terminal device acceleration in various directions (typically three axes). When the terminal equipment is static, the size and the direction of gravity can be detected. The method can also be used for recognizing the posture of the electronic equipment, and is applied to horizontal and vertical screen switching, pedometers and other applications.

A distance sensor 380F for measuring distance. The terminal device may measure the distance by infrared or laser. In some embodiments, the scene is photographed and the terminal device may range using the distance sensor 380F to achieve fast focus.

The proximity light sensor 380G may include, for example, a Light Emitting Diode (LED) and a light detector, such as a photodiode. The light emitting diode may be an infrared light emitting diode. The terminal device emits infrared light to the outside through the light emitting diode. The terminal device detects infrared reflected light from a nearby object using a photodiode. When sufficient reflected light is detected, it can be determined that there is an object near the terminal device. When insufficient reflected light is detected, the terminal device may determine that there is no object near the terminal device. The terminal device can detect that the user holds the terminal device by the proximity light sensor 380G and calls near the ear, so that the screen is automatically turned off to achieve the purpose of saving power. The proximity light sensor 380G may also be used in a holster mode, a pocket mode automatically unlocks and locks the screen.

The ambient light sensor 380L is used to sense the ambient light level. The terminal device may adaptively adjust the brightness of the display 394 based on the perceived ambient light level. The ambient light sensor 380L may also be used to automatically adjust the white balance when taking a picture. The ambient light sensor 380L may also cooperate with the proximity light sensor 380G to detect whether the terminal device is in a pocket to prevent accidental touches.

The fingerprint sensor 380H is used to capture a fingerprint. The terminal equipment can utilize the collected fingerprint characteristics to realize fingerprint unlocking, access to an application lock, fingerprint photographing, fingerprint incoming call answering and the like.

The temperature sensor 380J is used to detect temperature. In some embodiments, the terminal device executes a temperature processing strategy using the temperature detected by the temperature sensor 380J. For example, when the temperature reported by the temperature sensor 380J exceeds the threshold, the terminal device performs a reduction in the performance of the processor located near the temperature sensor 380J, so as to reduce power consumption and implement thermal protection. In other embodiments, the terminal device heats the battery 342 when the temperature is below another threshold to avoid abnormal shutdown of the terminal device due to low temperature. In other embodiments, the terminal device performs boosting of the output voltage of the battery 342 when the temperature is below a further threshold value to avoid abnormal shutdown due to low temperature.

The touch sensor 380K is also referred to as a "touch panel". The touch sensor 380K may be disposed on the display screen 394, and the touch sensor 380K and the display screen 394 form a touch screen, which is also referred to as a "touch screen". The touch sensor 380K is used to detect a touch operation applied thereto or thereabout. The touch sensor can communicate the detected touch operation to the application processor to determine the touch event type. Visual output associated with the touch operation may be provided via the display 394. In other embodiments, the touch sensor 380K may be disposed on the surface of the terminal device at a different location than the display 394.

The bone conduction sensor 380M can acquire a vibration signal. In some embodiments, the bone conduction transducer 380M can acquire a vibration signal of the vibrating bone mass of the human voice. The bone conduction sensor 380M may also contact the human body pulse to receive the blood pressure pulsation signal. In some embodiments, the bone conduction sensor 380M may also be disposed in a headset, integrated into a bone conduction headset. The audio module 370 may analyze a voice signal based on the vibration signal of the bone mass vibrated by the sound part acquired by the bone conduction sensor 380M, so as to implement a voice function. The application processor can analyze heart rate information based on the blood pressure beating signal acquired by the bone conduction sensor 380M, so as to realize the heart rate detection function.

Keys 390 include a power-on key, a volume key, etc. The keys 390 may be mechanical keys. Or may be touch keys. The terminal device may receive a key input, and generate a key signal input related to user setting and function control of the terminal device.

The motor 391 may generate a vibration cue. The motor 391 may be used for both incoming call vibration prompting and touch vibration feedback. For example, touch operations applied to different applications (e.g., photographing, audio playing, etc.) may correspond to different vibration feedback effects. The motor 391 may also respond to different vibration feedback effects by performing touch operations on different areas of the display 394. Different application scenes (such as time reminding, receiving information, alarm clock, game and the like) can also correspond to different vibration feedback effects. The touch vibration feedback effect may also support customization.

Indicator 392 may be an indicator light that may be used to indicate a state of charge, a change in charge, or a message, missed call, notification, etc.

The SIM card interface 395 is for connecting a SIM card. The SIM card can be attached to and detached from the terminal device by being inserted into or pulled out of the SIM card interface 395. The terminal equipment can support 1 or N SIM card interfaces, and N is a positive integer greater than 1. The SIM card interface 395 may support a Nano SIM card, a Micro SIM card, a SIM card, etc. Multiple cards can be inserted into the same SIM card interface 395 at the same time. The types of the plurality of cards may be the same or different. The SIM card interface 395 may also be compatible with different types of SIM cards. The SIM card interface 395 may also be compatible with an external memory card. The terminal equipment interacts with the network through the SIM card to realize functions of conversation, data communication and the like. In some embodiments, the end-point device employs esims, namely: an embedded SIM card. The eSIM card can be embedded in the terminal device and cannot be separated from the terminal device.

Fig. 4 is a schematic flow chart of a speech synthesis method provided in an embodiment of the present application, which may be applied to the terminal device described above by way of example and not limitation, and referring to fig. 4, the method includes:

step 401, obtaining an identity code.

Wherein the identity code is used to indicate the identity of the third party performing the voice conversion.

The terminal device can acquire an identity code representing the identity of a third party in the process of generating the voice data, and randomly adjust the phoneme duration of at least one phoneme in the text to be converted according to the identity code, so that the phoneme duration of at least one phoneme in the generated voice data is adjusted based on the identity code, and the identity of the third party can be determined according to the adjusted phoneme duration.

In a possible implementation manner, the terminal device may search a configuration file of the terminal device or the application program in a preset storage space, and then search an identity code for identifying the identity of the third party from the configuration file, so that in a subsequent step, the phoneme duration of at least one phoneme in the speech data may be adjusted according to the identity code.

Further, the terminal device may store a digital signature of a third party, that is, an identity signature of the third party, so that the terminal device may obtain the identity signature of the third party based on preset configuration information in the process of obtaining the identity code, and then search for the identity code corresponding to the identity signature from the corresponding relationship according to the identity signature and the preset corresponding relationship between the identity signature and the identity code.

Specifically, the terminal device may extract configuration information for representing the identity of the third party from the configuration file, search for the identity signature of the third party based on the configuration information, search for the identity signature consistent with the identity signature from the correspondence between the identity signature and the identity code according to the identity signature, and finally may use the identity code corresponding to the identity signature consistent with the identity signature in the searched correspondence as the identity code of the third party.

In addition, the identity code of the third party can be composed of N digits, the parameter of each digit can comprise M candidate parameter values, and then the third parties of the N times of M can be coded. For example, if N is 4 and M is 3, 3 × 81 third parties may be encoded, and if the 3 candidate parameter values are 0, 1, and 2, respectively, the 4-bit identity code may be a plurality of identity codes such as 0121, 2301, or 1132.

It should be noted that the number of bits N of the identity code of the third party and the number of parameter values M of the parameter of each bit of number may be adjusted according to the number of the third party, which is not limited in this embodiment of the present application.

Step 402, determining phoneme durations of phonemes corresponding to the text to be converted.

The text to be converted can be text information pre-stored in the terminal device, or can be text information generated by the terminal device according to an operation triggered by a user. For example, after detecting the question voice sent by the user, the terminal device may generate a text to be converted for answering according to the question voice of the user, convert the text to be converted, and generate and play voice data corresponding to the text to be converted.

Furthermore, the phoneme duration of the at least one phoneme is determined from the identity encoding. The phoneme time length is used to indicate the pronunciation time length of the phoneme when generating the speech data. The duration of the phoneme may be an initial duration of the phoneme, or a sum of the initial duration of the phoneme and a duration increment, where the initial duration may be a duration of each phoneme output by a duration model after the text to be converted is input into a pre-trained duration model, and the duration increment may be a duration calculated according to the identity code.

For example, the text to be converted may correspond to 10 phonemes, the terminal device may select a preset number of phonemes from the 10 phonemes as the target phoneme according to a preset selection manner, the phoneme duration of the target phoneme may be a sum of the initial duration and the duration increment, and the phoneme durations of the remaining phonemes may be the initial durations.

After the terminal equipment acquires the identity code, the text to be converted can be input into a preset duration model to obtain the initial duration of each phoneme, at least one target phoneme is selected from each phoneme corresponding to the text to be converted, the duration increment of each target phoneme is determined according to the identity code, and the phoneme duration of each target phoneme is determined according to the initial duration and the duration increment of each target phoneme. Meanwhile, for each phoneme except for the target phoneme among the phonemes corresponding to the text to be converted, the initial duration of the phoneme may be determined as the phoneme duration of the phoneme.

In a possible implementation manner, the terminal device may input the text to be converted into the duration model, determine an initial consonant and a final corresponding to each character in the text to be converted through the duration model, and determine a duration corresponding to each initial consonant and each final according to a neural network in the duration model, that is, determine an initial duration of each phoneme.

Then, the terminal device may calculate a duration increment according to the identity code, select at least one target phoneme from a plurality of phonemes corresponding to the text to be converted, add the initial duration of the target phoneme to the calculated increment duration to obtain a sum of the initial duration and the increment duration, and then use the sum as the phoneme duration of the target phoneme.

However, for the phonemes other than the target phoneme among the plurality of phonemes, the initial duration of the phoneme may be taken as the phoneme duration of the phoneme.

For example, the initial duration of the nth element in the plurality of elements may be t_nThe calculated incremental time length is Δ T, and if the nth phoneme is selected as the target phoneme, the phoneme time length corresponding to the nth phoneme may be T_n＝t_n+Δt。

It should be noted that, in the process of calculating the duration increment according to the identity code, the terminal device may determine the duration increment of the target phoneme according to the identity code and a preset increment factor. Optionally, the terminal device may multiply the identity code by the increment factor, and use the calculated product as the duration increment.

For example, the identity code is 0121, the increment factor is 5, and the duration increment Δ t is 0121 × 5 — 605, and the duration unit may be milliseconds.

In addition, the duration increment of each target phoneme may be the same.

However, in practical applications, the phoneme durations of the phonemes may be adjusted by other manners according to the identity codes, which is not limited in this embodiment of the present application. Optionally, the phonemes with the number consistent with the identity-coded digit are selected from the multiple phonemes as the target phonemes according to the digit of the identity code and the parameter of each digit, and the incremental durations corresponding to different target phonemes are obtained by calculation according to the sequence of each target phoneme and in combination with the parameter of each digit of the identity code, so that the phoneme duration of each target phoneme is determined according to the incremental duration corresponding to each target phoneme and the initial duration of each target phoneme.

For example, if the plurality of phonemes are "n", "i", "h", "ao", "h", "ua", "w", and "ei", respectively, and the identity code is 2131, and the identity code includes 4 bits, 4 phonemes may be selected from the plurality of phonemes as the target phoneme in a preset selection manner, if the selected target phonemes are "n", "ao", "h", and "w", respectively, the incremental duration corresponding to each target phoneme may be calculated according to the sequence of each target phoneme and by combining the parameter corresponding to each digit in the identity code, for example, the incremental duration corresponding to the target phoneme "n" is calculated by using the parameter "2" of the first digit of the identity code, the incremental duration corresponding to the target phoneme "ao" is calculated by using the parameter "1" of the second digit of the identity code, and the incremental duration corresponding to the target phoneme "h" is calculated by using the parameter "3" of the third digit of the identity code, the incremental duration corresponding to the target phoneme of "w" is calculated from the parameter "1" of the fourth digit of the identity code.

Step 403, converting the text to be converted according to the phoneme duration of each phoneme to obtain the voice data.

After the phoneme duration is determined, the terminal equipment can convert each character in the text to be converted into voice, so that voice data corresponding to the text to be converted is obtained, and in the process of determining the third party identity of the synthetic voice data, the determination can be carried out according to the phoneme duration of at least one target phoneme in the voice data.

In a possible implementation manner, after determining the phoneme durations, the terminal device may input the phoneme durations of the phonemes into an acoustic model by using a parametric method, determine and send parameters such as a fundamental frequency to a vocoder through the acoustic model, so as to generate voice data corresponding to the text to be converted through the vocoder.

Of course, the terminal device may also generate the voice data based on the determined phoneme duration by using a splicing method or an end-to-end method, and the method used for generating the voice data is not limited in the embodiment of the present application.

In summary, the speech synthesis method provided in the embodiment of the present application determines the phoneme durations of the phonemes corresponding to the text to be converted by obtaining the identity code indicating the identity of the third party performing the speech conversion and according to the identity code, wherein the phoneme duration of at least one phoneme is determined according to the identity code, and then according to the phoneme duration of each phoneme, converting the text to be converted to obtain the voice data, so that the voice data can include a digital signature indicating the identity of the third party, that is, a phoneme duration determined by the identity code, the identity code can be determined by the phoneme duration, thereby determining the identity of the third party company, solving the problem that the identity of the third party synthesizing the voice data cannot be determined, and the identity of a third party can be determined according to the voice data under the condition of secondary transcription, so that the stability of the digital signature is improved.

The above embodiment describes a process in which the terminal device generates voice data based on an identity code of a third party, while in the following embodiment, the terminal device takes an example of abnormal voice data that requires to confirm the identity of the third party and standard voice data that matches the abnormal voice data as an example to explain, and describes a process in which the terminal device determines the identity of the third party based on the abnormal voice data generated in the above embodiment.

Fig. 5 is a schematic flowchart of an identity recognition method provided in an embodiment of the present application, and by way of example and not limitation, the method may be applied to the terminal device described above, and referring to fig. 5, the method includes:

and step 501, extracting each phoneme in the standard voice data and the abnormal voice data respectively.

The standard voice data is the voice data generated by the terminal device according to the text to be converted without combining the identity code, and the abnormal voice data is the voice data generated by the terminal device according to the text to be converted by combining the identity code. That is, the phoneme length of at least one phoneme in the abnormal speech data is different from the phoneme length of at least one phoneme in the standard speech data.

When the identity of the third party synthesizing the abnormal speech data is determined, the phonemes of the standard speech data and the phonemes of the abnormal speech data may be extracted respectively, and the phoneme duration of each phoneme may be determined, so that in the subsequent step, the identity signature of the third party may be determined according to each phoneme duration.

It should be noted that, in practical applications, the terminal device may acquire the standard speech data and the abnormal speech data before extracting each phoneme. For example, the terminal device may receive the abnormal voice data sent by the server, obtain a text to be converted according to the abnormal voice data, and convert the text to be converted to obtain the standard voice data. Or, the terminal device may also receive abnormal voice data sent by the server and standard voice data sent to the terminal device based on the abnormal voice data, and the embodiment of the present application does not limit the manner in which the terminal device obtains the standard voice data and the abnormal voice data.

In addition, before extracting phonemes from the abnormal voice data, the terminal device may detect the integrity of the abnormal voice data, and avoid that the abnormal voice data is an illegal voice obtained by editing and synthesizing, and then the terminal device may first detect whether the abnormal voice data is complete voice data, and extract each phoneme in the standard voice data and the abnormal voice data respectively in the case that the abnormal voice data is complete voice data.

For example, the terminal device may evaluate packet loss, interruption, and word swallowing phenomena occurring in the abnormal voice data in a media plane data manner, thereby determining whether the abnormal voice data is complete voice data according to the evaluation result, and if the abnormal voice data is complete voice data, extracting phonemes of the standard voice data and the abnormal voice data; however, if the abnormal speech data is not the complete speech data, the phonemes of the standard speech data and the abnormal speech data are not extracted.

Step 502, for each phoneme in the abnormal voice data, comparing the phoneme with a phoneme matched with the phoneme in the standard voice data to obtain a duration difference value corresponding to the phoneme.

After extracting each phoneme, the terminal device may compare each phoneme in the abnormal speech data with a corresponding phoneme in the standard speech data to obtain a duration difference between durations of the phonemes, so that the terminal device may determine the identity signature according to at least one duration difference.

In a possible implementation manner, for each phoneme in the abnormal speech data, the terminal device may first determine a corresponding character of the phoneme in the text to be converted and a position of the character in the text to be converted, and then search for the corresponding phoneme in the standard speech data according to the determined positions of the character and the character.

Then, the terminal device may calculate the phoneme duration of the current phoneme in the abnormal speech data and the phoneme duration of the searched phoneme, and use a difference between the phoneme duration of the current phoneme and the phoneme duration of the searched phoneme as a duration difference. After each phoneme in the abnormal speech data is compared with each corresponding phoneme in the standard speech data, at least one duration difference value may be obtained.

The absolute value of the calculated time length difference may be 0, or may be greater than 0, and the identity signature of the third party may be determined according to each time length difference greater than 0.

It should be noted that, in practical applications, the target phoneme may be determined according to a preset selection manner, and then, in the process of determining the third-party identity, the target phoneme in the abnormal speech data may also be determined by combining the preset selection manner, and the phonemes matched in the labeled speech data are obtained, and then, the phoneme durations of the two phonemes are compared to obtain the duration difference.

And 503, determining the identity signature according to each time length difference value.

After the terminal device calculates the time length difference, the terminal device can further calculate according to the parameters corresponding to each time length difference, so as to obtain the identity code corresponding to the third party, and then can search and obtain the corresponding identity signature according to the identity code, so as to determine the identity of the third party.

Optionally, the terminal device may determine the identity code corresponding to the abnormal voice data according to each time length difference and the preset incremental factor, and then determine the identity signature according to the identity code corresponding to the abnormal voice data and the preset correspondence between the identity signature and the identity code.

In a possible implementation manner, the terminal device may obtain a preset increment factor, and perform calculation according to the time length difference and the increment factor by combining with a formula for calculating the increment time length to obtain an identity code corresponding to a third party, and then search for an identity code consistent with the calculated identity code from a correspondence between a preset identity signature and the identity code, so as to determine an identity signature corresponding to the identity code, that is, an identity signature corresponding to the third party synthesizing the abnormal voice data.

For example, if the calculated time difference is 615 ms, and the increment factor is 5, the calculated identity code may be 615/5 ═ 123, and then the identity signature of the third party may be found from the correspondence between the identity code and the identity signature according to the identity code 615, so as to complete the confirmation of the third party.

In summary, according to the identity recognition method provided in the embodiment of the present application, each phoneme in the standard speech data and each phoneme in the abnormal speech data are extracted respectively, and for each phoneme in the abnormal speech data, the phoneme may be compared with a phoneme matched in the standard speech data to obtain at least one time length difference, and then the identity signature is determined according to the at least one time length difference, so that the identity of a third party may be determined according to the identity signature, thereby avoiding that the identity of the third party synthesizing the speech data cannot be determined, and also determining the identity of the third party according to the phoneme time length of a target phoneme in the speech data under the condition of secondary transcription, thereby improving the stability of the digital signature.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

Fig. 6 is a block diagram of a speech synthesis apparatus according to an embodiment of the present application, which corresponds to the speech synthesis method described in the foregoing embodiment, and only shows portions related to the embodiment of the present application for convenience of description.

Referring to fig. 6, the apparatus includes:

a first obtaining module 601, configured to obtain an identity code, where the identity code is used to indicate an identity of a third party performing voice conversion;

a first determining module 602, configured to determine a phoneme duration of each phoneme corresponding to the text to be converted, where the phoneme duration of at least one phoneme is determined according to the identity code;

the converting module 603 is configured to convert the text to be converted according to the phoneme duration of each phoneme to obtain voice data.

Optionally, the first obtaining module 601 is specifically configured to obtain an identity signature of the third party based on preset configuration information; and searching the identity code corresponding to the identity signature from the preset corresponding relation between the identity signature and the identity code.

Optionally, the first determining module 602 is specifically configured to input the text to be converted into a preset duration model, so as to obtain an initial duration of each phoneme; selecting at least one target phoneme from the phonemes corresponding to the text to be converted; determining the duration increment of each target phoneme according to the identity code; determining the phoneme duration of each target phoneme according to the initial duration and the duration increment of each target phoneme; and determining the initial duration of each phoneme except the target phoneme in each phoneme corresponding to the text to be converted as the phoneme duration of the phoneme.

Optionally, the duration increment of each target phoneme is the same, and the target phoneme is randomly selected.

Optionally, the first determining module 602 is further specifically configured to determine a duration increment of each target phoneme according to the identity code and a preset increment factor.

Optionally, referring to fig. 7, the apparatus further includes:

a second obtaining module 604, configured to obtain standard voice data and abnormal voice data;

an extracting module 605, configured to extract each phoneme in the standard speech data and the abnormal speech data respectively;

a comparing module 606, configured to compare, for each phoneme in the abnormal speech data, the phoneme with a phoneme matched with the standard speech data, so as to obtain a duration difference corresponding to the phoneme;

a second determining module 607, configured to determine the identity signature according to each of the time length difference values.

Optionally, the second determining module 607 is specifically configured to determine, according to each of the duration difference values and a preset increment factor, an identity code corresponding to the abnormal voice data; and determining the identity signature according to the identity code corresponding to the abnormal voice data and the preset corresponding relation between the identity signature and the identity code.

Optionally, the extracting module 605 is specifically configured to, in a case that the abnormal speech data is complete speech data, respectively extract each phoneme in the standard speech data and the abnormal speech data.

In summary, the speech synthesis apparatus provided in the embodiment of the present application determines the phoneme durations of the phonemes corresponding to the text to be converted by obtaining the identity code indicating the identity of the third party performing the speech conversion and according to the identity code, wherein the phoneme duration of at least one phoneme is determined according to the identity code, and then according to the phoneme duration of each phoneme, converting the text to be converted to obtain the voice data, so that the voice data can include a digital signature indicating the identity of the third party, that is, a phoneme duration determined by the identity code, the identity code can be determined by the phoneme duration, thereby determining the identity of the third party company, solving the problem that the identity of the third party synthesizing the voice data cannot be determined, and the identity of a third party can be determined according to the voice data under the condition of secondary transcription, so that the stability of the digital signature is improved.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the processes in the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium and can implement the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include at least: any entity or means of a terminal device, recording medium, computer Memory, Read-Only Memory (ROM), Random-Access Memory (RAM), electrical carrier signal, telecommunications signal and software distribution medium capable of carrying the computer program code. Such as a usb-disk, a removable hard disk, a magnetic or optical disk, etc. In certain jurisdictions, computer-readable media may not be an electrical carrier signal or a telecommunications signal in accordance with legislative and patent practice.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A method of speech synthesis, comprising:

2. The speech synthesis method of claim 1, wherein the obtaining the identity code comprises:

3. The speech synthesis method of claim 1, wherein the determining the phoneme durations for the phonemes corresponding to the text to be converted comprises:

4. The speech synthesis method of claim 3, wherein the duration increment of each of the target phonemes is the same.

5. The method of speech synthesis according to claim 3 wherein said determining a duration delta for each of said target phonemes from said identity codes comprises:

6. The speech synthesis method of claim 1, wherein the method further comprises:

acquiring standard voice data and abnormal voice data;

7. The speech synthesis method of claim 6, wherein said determining an identity signature based on each of said duration differences comprises:

8. The speech synthesis method according to claim 6 or 7, wherein the extracting of each phoneme in the standard speech data and the abnormal speech data, respectively, comprises:

9. A speech synthesis apparatus, comprising:

10. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 8 when executing the computer program.

11. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 8.