CN118098199A

CN118098199A - Personalized speech synthesis method, electronic device, server and storage medium

Info

Publication number: CN118098199A
Application number: CN202410510488.6A
Authority: CN
Inventors: 龚雪飞; 邢晓羊; 何金玲; 金鑫
Original assignee: Honor Device Co Ltd
Current assignee: Honor Device Co Ltd
Priority date: 2024-04-26
Filing date: 2024-04-26
Publication date: 2024-05-28

Abstract

The embodiment of the application provides a personalized voice synthesis method, electronic equipment, a server and a storage medium, wherein the method comprises the following steps: under the condition that the electronic equipment starts a voice synthesis function, acquiring text information which is needed to be subjected to voice synthesis in the electronic equipment, wherein the text information comprises text information input by a user and text information generated by the electronic equipment; acquiring an audio feature and a text feature corresponding to a user, wherein the audio feature is generated according to voice information recorded when the user performs voice synthesis function registration, and the text feature is generated according to registered text information read by the user when the user performs voice synthesis function registration; determining the stretching time length when the text information is subjected to voice synthesis based on the audio frequency characteristics, the text characteristics and the coding results corresponding to the text information; based on the stretching time length, the text information and the audio features are synthesized through the audio model to output personalized voice. The method can improve the effect of voice synthesis.

Description

Personalized speech synthesis method, electronic device, server and storage medium

Technical Field

The present application relates to the field of electronic technologies, and in particular, to a personalized speech synthesis method, an electronic device, a server, and a storage medium.

Background

With the continuous development of electronic devices, the abundant functions of the electronic devices provide great convenience for the life of users. For example, in the case where the electronic device turns on the voice assistant function, the user may perform a voice conversation with the electronic device, and if the user inputs a sentence of voice, the electronic device may perform a voice reply based on the voice. For another example, if the electronic device turns on an artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) call function, the user may input text at the call interface if the electronic device is on an incoming call, causing the electronic device to send text synthesized speech to the other electronic device.

In the above scenario, whether the electronic device replies to the voice or synthesizes the text into the voice, a Text To Speech (TTS) process is involved. In order to improve the use experience of the user, the electronic equipment is also provided with a personalized voice synthesis function, so that the user can record voice by himself, and the voice output by the subsequent electronic equipment is the tone of the user. For this function, how to make the synthesized speech more closely approximate to the real prosody characteristics of the user is a problem to be solved.

Disclosure of Invention

The application provides a personalized voice synthesis method, electronic equipment, a server and a storage medium, which can enable the rhythm and other characteristics of synthesized personalized voice to be more close to the actual characteristics of a user, and the voice synthesis effect is better.

In a first aspect, the present application provides a personalized speech synthesis method, the method comprising: under the condition that the electronic equipment starts a voice synthesis function, acquiring text information which is needed to be subjected to voice synthesis in the electronic equipment, wherein the text information comprises text information input by a user and text information generated by the electronic equipment; acquiring an audio feature and a text feature corresponding to a user, wherein the audio feature is generated according to voice information recorded when the user performs voice synthesis function registration, and the text feature is generated according to registered text information read by the user when the user performs voice synthesis function registration; determining the stretching time length when the text information is subjected to voice synthesis based on the audio frequency characteristics, the text characteristics and the coding results corresponding to the text information; based on the stretching time length, performing voice synthesis on the text information and the audio features through an audio model, and outputting personalized voice, wherein the audio model is a model which is obtained by training according to the first text information and the first voice information in training data and is irrelevant to the text.

When the user uses the voice synthesis function of the electronic device, the electronic device can output the tone voice (i.e., personalized voice) which the user wants to use, for example, in a scene of a voice assistant function, after the user inputs a sentence, the electronic device can reply by using the personalized voice, for example, in a scene of an AI call function, the user inputs text information, and the electronic device can convert the text information into the personalized voice (such as the voice of the user) to reply to the other user.

It should be noted that, in the first aspect, the personalized speech synthesis method of the present application may be executed by an electronic device, or may be executed by a server (e.g. a cloud server), and the execution of the personalized speech synthesis method by the electronic device is described as an example.

In the application, when the user registers the voice synthesis function (i.e. self-defined voice registration), the user can read the prompt phrase or sentence (i.e. registration text), the electronic equipment can extract the audio features corresponding to the user according to the voice information input by the user, and after receiving the text information, the subsequent electronic equipment can process the text information and the audio features corresponding to the user based on the trained audio model to obtain the personalized voice with the tone of the user. The audio model is a model which is obtained by training according to the first text information and the first voice information in the training data and is irrelevant to the text, and when the text information and the audio features corresponding to the user are processed, the audio model can take the audio features corresponding to the user as prompts, synthesize the voice with the tone of the user and output the voice. In order to make the rhythm and other characteristics of the outputted voice information closer to the actual characteristics of the user, the duration corresponding to the voice information needs to be effectively predicted, so that the present text information is predicted for the stretching duration based on the audio characteristics and text characteristics when the user registers the voice synthesis function, and the text information and the audio characteristics are synthesized for the voice based on the stretching duration, thereby improving the effect of voice synthesis.

For the text information in the electronic device that needs to perform speech synthesis, in the scenario of the voice assistant function, the text information may be information that the electronic device queries and replies to the speech input by the user, and in the scenario of the AI call function, the text information may be information input by the user that is to reply to the opposite user.

With reference to the first aspect, in some implementation manners of the first aspect, determining a stretching duration when performing speech synthesis on the text information based on the audio feature, the text feature, and the encoding result corresponding to the text information includes: inputting the audio features and the text features into a first time length prediction model, and outputting a first prediction time length; inputting the coding result corresponding to the first predicted time length and the text information into a second time length prediction model, and outputting the stretching time length.

In order to improve the prediction precision of the stretching time length and the effect of voice synthesis, a two-stage time length prediction model is adopted, the first time length prediction model outputs a first prediction time length based on the audio characteristics and the text characteristics corresponding to a user, then the first prediction time length and the coding result corresponding to the text information to be subjected to voice synthesis are input into a second time length prediction model, the first prediction time length is used as prompt information, and the second prediction time length, namely the stretching time length, is output.

With reference to the first aspect, in some implementation manners of the first aspect, a training manner of the first duration prediction model and the second duration prediction model includes:

Acquiring a first semantic feature corresponding to first text information and a first acoustic feature corresponding to first voice information;

Aligning the first semantic features with the first acoustic features, and determining first stretching time length features corresponding to the first semantic features;

Generating a random number k, dividing the first semantic feature into a first sub-feature and a second sub-feature according to the random number k, dividing the first stretching time length feature into a third sub-feature and a fourth sub-feature, and dividing the first acoustic feature into a fifth sub-feature and a sixth sub-feature;

Inputting the first sub-feature and the fifth sub-feature into a first time length prediction model, and adjusting the parameter value of the first time length prediction model according to the loss between the output result and the third sub-feature until the first time length prediction model converges;

And inputting the second sub-feature and the third sub-feature into a second duration prediction model, and adjusting the parameter value of the second duration prediction model according to the loss between the output result and the fourth sub-feature until the second duration prediction model converges.

It will be appreciated that if the two-stage duration prediction model is to be used, the first duration prediction model and the second duration prediction model need to be trained, and the training data used in the training process may be consistent with the training data used in the audio model. Firstly, the electronic device may perform front-end processing, dimension mapping processing (embedding), encoding processing (encoding) and length stretching processing (LR) on first text information in the training data to obtain first semantic features (semantic), and then perform audio encoding processing, multi-channel mapping processing (embedding) and multi-channel accumulation processing on the first speech information in the training data to obtain first acoustic features (acousic). Meanwhile, when the first semantic features are aligned with the first acoustic features, a first stretching duration feature (duration) corresponding to the first semantic features can be obtained. The electronic device may then generate a random number k, divide the first semantic feature into a first sub-feature, sample_se, and a second sub-feature, target_se, divide the first stretch duration feature into a third sub-feature, sample_dur, and a fourth sub-feature, target_dur, and divide the first acoustic feature into a fifth sub-feature, sample_ac, and a sixth sub-feature, target_ac. Next, a first time length prediction model may be input into the prot_se and the prot_ac, a first time length pmt_dur_ predict when the user outputs a phoneme corresponding to the prot_se is predicted, a first loss between pmt_dur_ predict and prot_dur is calculated, and model parameters of the first time length prediction model are adjusted based on the first loss until training converges; and inputting the prompt_dur and the target_se into a second duration prediction model, predicting a second duration tg_dur_ predict when the user outputs a phoneme corresponding to the target_se by using the prompt_dur as prompt information, calculating a second loss between tg_dur_ predict and the target_dur, and adjusting model parameters of the second duration prediction model based on the second loss until training converges.

In the training process of the second duration prediction model, the prompt_dur and the target_se are input, and compared with the prompt_ac, the information quantity is greatly reduced, so that the interference of irrelevant information can be reduced, the convergence effect of the duration prediction model is improved, the accuracy of the predicted voice duration is further improved, and the effect of voice synthesis is improved.

With reference to the first aspect, in some implementations of the first aspect, the value range of the random number k is [ ]，/>L is the data length of the first semantic feature.

With reference to the first aspect, in some implementations of the first aspect, the dividing the first semantic feature into the first sub-feature and the second sub-feature according to the random number k includes: taking the k-length feature as a first sub-feature from the first position of the first semantic feature, and taking the combination of the features of the rest part as a second sub-feature;

dividing the first stretch duration feature into a third sub-feature and a fourth sub-feature, comprising: taking the characteristic of k length from the first position of the first stretching time length characteristic as a third sub-characteristic, and taking the characteristic combination of the rest part as a fourth sub-characteristic;

Dividing the first acoustic feature into a fifth sub-feature and a sixth sub-feature, comprising: the feature of k length is taken as a fifth sub-feature from the first position of the first acoustic feature, and the combination of the features of the remaining part is taken as a sixth sub-feature.

When the random number k is used for dividing each feature, considering that blank voice may exist for a period of time just before a user speaks, if the semantic feature and the acoustic feature are divided into a front part and a rear part, some null values may exist in the features of the front part, and if the features of the front part are marked as the prompt to be used as the prompt information subsequently, the null values affect the prediction precision. Therefore, the electronic device can intercept the intermediate part of the semantic feature, the stretching time length and the acoustic feature according to the random number k, and then combine the rest parts to obtain the sample and the target of each feature.

With reference to the first aspect, in some implementations of the first aspect, the performing, by using an audio model, speech synthesis on the text information and the audio feature based on the stretching duration, and outputting personalized speech includes: extracting characteristics of the text information through an audio model to obtain semantic characteristics with the length of the stretching duration corresponding to the text information; and performing voice synthesis on the semantic features and the audio features through the audio model, and outputting personalized voice.

When the electronic equipment adopts the audio model to carry out voice synthesis on the text information and the audio features corresponding to the user, the feature extraction can be carried out on the text information based on the stretching time length to obtain semantic features with the length of the stretching time length, and then the voice synthesis is carried out based on the semantic features and the audio features corresponding to the user.

In some implementations, the electronic device may perform front-end processing such as text regularization processing, prosody prediction processing, phonetic notation processing on the text information to obtain phoneme information corresponding to the text information, and then perform dimension mapping processing, encoding processing and timely long stretching processing on the phoneme information through the audio model to obtain semantic features with a length of a stretching duration.

The dimension mapping process refers to a process of mapping high-dimension data (such as text, pictures and audio) to a low-dimension space; the coding processing refers to the process of carrying out feature integration on the input text information and converting the feature blocks of the high-level abstraction; the length stretching processing refers to stretching of reading length of the text information, namely, predicting length of each text (or phoneme) when reading, and adding the length information into the text information.

When the voice synthesis is carried out on the semantic features and the audio features corresponding to the user through the audio model, the audio features corresponding to the user can be used as tone cues, and the semantic features and the audio features are fused and decoded so as to convert the text information into personalized voices with the same tone as the user. Therefore, no matter what text information is, the text information can be converted into personalized voice only by extracting the audio characteristics when the user registers, and the processing efficiency of voice synthesis is improved.

With reference to the first aspect, in some implementations of the first aspect, the acquiring an audio feature and a text feature corresponding to a user includes: acquiring voice information input by a user when registering a voice synthesis function and registration text information read by the user when registering the voice synthesis function; performing audio coding processing, multichannel mapping processing and multichannel accumulating processing on the input voice information to obtain audio characteristics corresponding to a user; and performing front-end processing on the registered text information to obtain text characteristics.

When the user registers the voice synthesis function, the electronic equipment is provided with a corresponding registration interface. For example, in the context of a voice assistant function, a user may set a broadcast tone, such as a custom tone setting, on the intelligent voice setting interface. And displaying a prompt for reading the phrase on the custom tone setting interface, wherein the user can read the corresponding phrase (namely the registered text) so as to record the voice of the user by the electronic equipment. Alternatively, if the user does not succeed in recording, recording may be resumed. Optionally, in the application, when the user inputs voice information, the corresponding number N of short sentences read by the user is less than or equal to 2.

After the recording is completed, the electronic device can perform audio encoding processing, multichannel mapping processing and multichannel accumulating processing on the voice information recorded by the user so as to obtain the audio characteristics corresponding to the user, and perform front-end processing on the registered text information so as to obtain the text characteristics. Optionally, the electronic device may further store the audio feature corresponding to the user and the identifier of the user in a feature library, and then the subsequent electronic device may directly search the audio feature corresponding to the user from the feature library when performing speech synthesis.

In the implementation mode, when the user registers the custom tone, the user can finish the tone color registration by only reading few phrases, and the user does not need to read multiple phrases, so that the operation is simpler and the user experience is improved.

In a second aspect, the present application provides a training method of a duration prediction model, where the method includes:

acquiring first text information and first voice information in training data;

Extracting first semantic features corresponding to the first text information and first acoustic features corresponding to the first voice information;

In the training process of the duration prediction model, as the third sub-feature promt_dur and the second sub-feature target_se are input in the training process of the second duration prediction model, compared with the promt_ac, the information quantity is greatly reduced, so that the interference of irrelevant information can be reduced, the convergence effect of the duration prediction model is improved, the accuracy of the predicted voice duration is further improved, and the effect of voice synthesis is improved.

In a third aspect, the present application provides an apparatus, which is included in an electronic device, and which has a function of implementing the electronic device behavior in the first aspect and possible implementations of the first aspect, or has a function of implementing the electronic device behavior in the second aspect. The functions may be realized by hardware, or may be realized by hardware executing corresponding software. The hardware or software includes one or more modules or units corresponding to the functions described above. Such as a receiving module or unit, a processing module or unit, etc.

In a fourth aspect, the present application provides an electronic device, including: a processor, a memory, and an interface; the processor, the memory and the interface cooperate with each other such that the electronic device performs any one of the methods of the technical solutions of the first aspect or performs the method of the technical solutions of the second aspect.

In a fifth aspect, the present application provides a server comprising one or more processors; one or more memories; the memory stores one or more programs that, when executed by the processor, cause the server to perform any one of the methods of the first aspect or perform the method of the second aspect.

In one implementation manner, the server may be a cloud server, and after the cloud server obtains the personalized voice, the cloud server may also return the personalized voice to the electronic device, where the electronic device outputs or sends the personalized voice to other electronic devices.

In a sixth aspect, the present application provides a chip comprising a processor. The processor is configured to read and execute a computer program stored in the memory to perform the method of the first aspect and any possible implementation thereof, or to perform the method of the technical solution of the second aspect.

Optionally, the chip further comprises a memory, and the memory is connected with the processor through a circuit or a wire.

Further optionally, the chip further comprises a communication interface.

In a seventh aspect, the present application provides a computer-readable storage medium, in which a computer program is stored, which when executed by a processor causes the processor to perform any one of the methods of the first aspect or to perform the method of the second aspect.

In an eighth aspect, the present application provides a computer program product comprising: computer program code which, when run on an electronic device, causes the electronic device to perform any one of the methods of the technical solutions of the first aspect or to perform the method of the technical solutions of the second aspect.

Drawings

FIG. 1 is a diagram illustrating an example of a voice assistant function according to an embodiment of the present application;

Fig. 2 is an application scenario diagram of an AI call function provided in an embodiment of the present application;

FIG. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

FIG. 4 is a schematic diagram of an example of an interface for setting the voice assistant function according to an embodiment of the present application;

FIG. 5 is an interface diagram of an exemplary custom tone setting process according to an embodiment of the present application;

FIG. 6 is an interface diagram of another custom tone setting process provided by an embodiment of the present application;

FIG. 7 is an interface diagram of a custom tone setting process according to an embodiment of the present application;

FIG. 8 is a schematic diagram of an audio model training process according to an embodiment of the present application;

FIG. 9 is a schematic diagram of an example duration prediction model training process provided by an embodiment of the present application;

FIG. 10 is a process diagram of an exemplary personalized speech synthesis method according to an embodiment of the present application;

FIG. 11 is a process diagram of another personalized speech synthesis method provided by an embodiment of the present application;

FIG. 12 is a schematic diagram of an example of a process for predicting stretch duration according to an embodiment of the present application;

FIG. 13 is a block diagram of a software architecture of an example electronic device according to an embodiment of the present application;

fig. 14 is a flowchart of an exemplary personalized speech synthesis method according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application. Wherein, in the description of the embodiments of the present application, unless otherwise indicated, "/" means or, for example, a/B may represent a or B; "and/or" herein is merely an association relationship describing an association object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In addition, in the description of the embodiments of the present application, "plurality" means two or more than two.

The terms "first," "second," "third," and the like, are used below for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first", "a second", or a third "may explicitly or implicitly include one or more such feature.

With the continuous development of electronic devices, the abundant functions of the electronic devices provide great convenience for the life of users. Illustratively, in the case where the electronic device turns on a voice assistant function (e.g., an a assistant function), the user may perform a voice dialogue with the electronic device, and if the user inputs a sentence of voice, the electronic device may perform a voice reply based on the voice, for example, the user inputs "how weather today" and "how lowest air temperature is 15 degrees celsius today, and the highest air temperature is 25 degrees celsius"; or the user may also voice control the electronic device, e.g., the user voice inputs "open a application," the electronic device may voice reply "good" and enter into the interface of the a application. Further, for example, in the case where the electronic device turns on the AI call function, if the electronic device has an incoming call, the user may input text at the call interface, so that the electronic device transmits text-synthesized voice to the counterpart electronic device.

In the context of the voice assistant function, for example, after the user triggers to wake up the voice assistant, the electronic device a may display a voice input interface as shown in the diagram (a) in fig. 1, and the user may input "how weather today" by voice, where text corresponding to the voice may be displayed on the electronic device a. After the electronic device a recognizes the voice, the background can query the weather information today through the network, and convert the text corresponding to the weather information into the voice for outputting, for example, the voice outputs "the lowest air temperature today is 15 degrees celsius and the highest air temperature is 25 degrees celsius". Or as shown in the (b) diagram in fig. 1, after the electronic device a recognizes the voice, a voice dialog box may be displayed, in which the text of "how weather today" that is input by the user by voice and the text corresponding to the weather information that is queried by the electronic device a are displayed, "the minimum air temperature today is 15 degrees celsius and the maximum air temperature is 25 degrees celsius", and at the same time, the electronic device a may also output the weather information by voice.

In the scenario of the AI call function, as shown in fig. 2, for example, when the electronic device a receives an incoming call of the electronic device B, the electronic device a displays an incoming call interface, where the incoming call interface may include information such as a user name, an incoming call number, a number attribution, an operator, and the like, and may further include an answer control 21, a rejection control 22, and a first control 23, where optionally, the first control 23 may be a display form for user interaction on the electronic device such as a touch popup window, a card, a control, and a suspension ball. Alternatively, the first control 23 may include a text prompt options area that turns on the "talk caption" function. If the user picks up an incoming call on the electronic device a and clicks the first control 23, the electronic device a may update and display the first control 23 as the first window 200, i.e., the user can open the call subtitle directly while picking up the call. At this time, the electronic device a may convert the voice information 1 transmitted by the electronic device B into corresponding text information 1, and display the text information 1 in the first window 200, for example, the text information 1 shown in fig. 2 is "hello, i is the financial advisor xiao Li". If the user inputs the information "please ask what happens" to be replied on the electronic device a and clicks the send control 24, the electronic device a may also display the text information 2 "please ask what happens" input by the user in the first window 200, and convert the text information "please ask what happens" into the speech information 2 "please ask what happens" to be transmitted to the electronic device B.

As can be seen from the above description of the scenario, whether the electronic device a performs a voice reply of a voice assistant or a voice call of an AI call, the process of synthesizing TTS by voice is involved, and the voice synthesis aims to make the electronic device generate human voices with different timbres. In contrast, in the voice assistant function, the electronic device converts the searched text information or the text information in the fixed reply into voice for outputting, and in the AI call function, the electronic device converts the text information input by the user into voice for sending to the electronic device of the counterpart, but the electronic device essentially needs to convert the text information into voice information.

In recent years, in order to improve the use experience of users, the electronic device is further provided with a personalized voice synthesis function, so that the users can record voices themselves, and the subsequent electronic device can output voices of the colors of the users themselves. In the context of the voice assistant function, the user may define a tone color, record a voice by reading the template text, and generate a personalized tone color of the user. In the AI conversation scene, the user can also define tone colors to generate personalized tone colors belonging to the user, so that when the subsequent user inputs character information to be replied on the electronic equipment A, the electronic equipment can convert the character information into voice information corresponding to personalized voice, and thus the voice replying to the user by using the voice of the user is equivalent, and the user experience is better.

In the related art, an electronic device generally needs to construct an audio model, process audio features of a user when customizing tone colors and text information to be output in a voice manner through the audio model, and output voice information with the tone colors of the user corresponding to the text information. However, in this process, since the voice information corresponding to each text has a voice duration, if the voice duration is not predicted effectively, the difference between the prosody and other characteristics of the output voice information and the actual characteristics of the user may occur, that is, the synthesis effect of the personalized voice is poor.

In view of this, the embodiment of the application provides a personalized speech synthesis method, when predicting speech duration of text information, a two-stage duration prediction model is adopted to improve accuracy of the predicted speech duration, so that characteristics of rhythm and the like of personalized speech synthesized by an audio model are closer to actual characteristics of a user, and a speech synthesis effect is better. It should be noted that, the personalized speech synthesis method provided by the application can be applied to mobile phones, tablet computers, wearable devices, vehicle-mounted devices, augmented reality (augmented reality, AR)/Virtual Reality (VR) devices, notebook computers, ultra-mobile personal computer (UMPC), netbooks, personal digital assistants (personal DIGITAL ASSISTANT, PDA), smart home and other electronic devices capable of performing personalized speech setting.

Fig. 3 is a schematic structural diagram of an electronic device 100 according to an embodiment of the present application. Taking the example of the electronic device 100 being a mobile phone, the electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (universal serial bus, USB) interface 130, a charge management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, keys 190, a motor 191, an indicator 192, a camera 193, a display 194, a subscriber identity module (subscriber identification module, SIM) card interface 195, and the like. The sensor module 180 may include a pressure sensor 180A, a gyro sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, a bone conduction sensor 180M, and the like.

The processor 110 may include one or more processing units, such as: the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (IMAGE SIGNAL processor, ISP), a controller, a memory, a video codec, a digital signal processor (DIGITAL SIGNAL processor, DSP), a baseband processor, and/or a neural Network Processor (NPU), etc. Wherein the different processing units may be separate devices or may be integrated in one or more processors.

The controller may be a neural hub and a command center of the electronic device 100, among others. The controller can generate operation control signals according to the instruction operation codes and the time sequence signals to finish the control of instruction fetching and instruction execution.

A memory may also be provided in the processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that the processor 110 has just used or recycled. If the processor 110 needs to reuse the instruction or data, it may be called directly from memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby improving the efficiency of the system.

The wireless communication function of the electronic device 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, a modem processor, a baseband processor, and the like.

The antennas 1 and 2 are used for transmitting and receiving electromagnetic wave signals. The structures of the antennas 1 and 2 in fig. 3 are only one example. Each antenna in the electronic device 100 may be used to cover a single or multiple communication bands. Different antennas may also be multiplexed to improve the utilization of the antennas. For example: the antenna 1 may be multiplexed into a diversity antenna of a wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.

The mobile communication module 150 may provide a solution for wireless communication including 2G/3G/4G/5G, etc., applied to the electronic device 100. The mobile communication module 150 may include at least one filter, switch, power amplifier, low noise amplifier (low noise amplifier, LNA), etc. The mobile communication module 150 may receive electromagnetic waves from the antenna 1, perform processes such as filtering, amplifying, and the like on the received electromagnetic waves, and transmit the processed electromagnetic waves to the modem processor for demodulation. The mobile communication module 150 can amplify the signal modulated by the modem processor, and convert the signal into electromagnetic waves through the antenna 1 to radiate. In some embodiments, at least some of the functional modules of the mobile communication module 150 may be disposed in the processor 110. In some embodiments, at least some of the functional modules of the mobile communication module 150 may be provided in the same device as at least some of the modules of the processor 110.

The wireless communication module 160 may provide solutions for wireless communication including wireless local area network (wireless local area networks, WLAN) (e.g., wireless fidelity (WIRELESS FIDELITY, wi-Fi) network), bluetooth (BT), global navigation satellite system (global navigation SATELLITE SYSTEM, GNSS), frequency modulation (frequency modulation, FM), near field communication (NEAR FIELD communication, NFC), infrared (IR), etc., applied to the electronic device 100. The wireless communication module 160 may be one or more devices that integrate at least one communication processing module. The wireless communication module 160 receives electromagnetic waves via the antenna 2, modulates the electromagnetic wave signals, filters the electromagnetic wave signals, and transmits the processed signals to the processor 110. The wireless communication module 160 may also receive a signal to be transmitted from the processor 110, frequency modulate it, amplify it, and convert it to electromagnetic waves for radiation via the antenna 2.

In some embodiments, antenna 1 and mobile communication module 150 of electronic device 100 are coupled, and antenna 2 and wireless communication module 160 are coupled, such that electronic device 100 may communicate with a network and other devices through wireless communication techniques. The wireless communication techniques can include a global system for mobile communications (global system for mobile communications, GSM), general packet radio service (GENERAL PACKET radio service, GPRS), code division multiple access (code division multiple access, CDMA), wideband code division multiple access (wideband code division multiple access, WCDMA), time division code division multiple access (time-division code division multiple access, TD-SCDMA), long term evolution (long term evolution, LTE), BT, GNSS, WLAN, NFC, FM, and/or IR techniques, among others. The GNSS may include a global satellite positioning system (global positioning system, GPS), a global navigation satellite system (global navigation SATELLITE SYSTEM, GLONASS), a beidou satellite navigation system (beidou navigation SATELLITE SYSTEM, BDS), a quasi zenith satellite system (quasi-zenith SATELLITE SYSTEM, QZSS) and/or a satellite based augmentation system (SATELLITE BASED AUGMENTATION SYSTEMS, SBAS).

The electronic device 100 implements display functions through a GPU, a display screen 194, an application processor, and the like. The GPU is a microprocessor for image processing, and is connected to the display 194 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 110 may include one or more GPUs that execute program instructions to generate or change display information.

The electronic device 100 may implement photographing functions through an ISP, a camera 193, a video codec, a GPU, a display screen 194, an application processor, and the like.

The NPU is a neural-network (NN) computing processor, and can rapidly process input information by referencing a biological neural network structure, for example, referencing a transmission mode between human brain neurons, and can also continuously perform self-learning. Applications such as intelligent awareness of the electronic device 100 may be implemented through the NPU, for example: image recognition, face recognition, speech recognition, text understanding, etc.

The external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, to enable expansion of the memory capabilities of the electronic device 100. The external memory card communicates with the processor 110 through an external memory interface 120 to implement data storage functions. For example, files such as music, video, etc. are stored in an external memory card.

The internal memory 121 may be used to store computer-executable program code that includes instructions. The processor 110 executes various functional applications of the electronic device 100 and data processing by executing instructions stored in the internal memory 121. The internal memory 121 may include a storage program area and a storage data area. The storage program area may store an application program (such as a sound playing function, an image playing function, etc.) required for at least one function of the operating system, etc. The storage data area may store data created during use of the electronic device 100 (e.g., audio data, phonebook, etc.), and so on. In addition, the internal memory 121 may include a high-speed random access memory, and may further include a nonvolatile memory such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (universal flash storage, UFS), and the like.

The electronic device 100 may implement audio functions through an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, an application processor, and the like. Such as music playing, recording, etc.

It should be understood that the illustrated structure of the embodiment of the present application does not constitute a specific limitation on the electronic device 100. In other embodiments of the application, electronic device 100 may include more or fewer components than shown, or certain components may be combined, or certain components may be split, or different arrangements of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

It will also be appreciated that the above illustration of fig. 3 is merely exemplary of when the electronic device is a cell phone. If the electronic device is a tablet computer, a PC, a PDA, a wearable device, or other types of devices, the electronic device may include fewer structures than those shown in fig. 3, or may include more structures than those shown in fig. 3, which is not limited herein.

As can be seen from the above description, if the user wants to use the personalized speech synthesis function, a custom tone color (i.e., a registration of the speech synthesis function) is required. The embodiment of the application provides a setting process of a personalized voice synthesis function, and the setting process is described below by taking setting in a scene of a voice assistant function as an example. Illustratively, as shown in fig. 4, a user clicking on a setup icon on the desktop interface of the electronic device may cause the electronic device to enter the setup interface. In the setup interface, setup options with different functions are included, such as WLAN option, bluetooth option, display and brightness option, smart assistant option, etc., in case the user clicks on the smart assistant option control 41, the electronic device may jump to the smart assistant setup interface. In the case of an intelligent assistant setup interface including various intelligent functions such as assistant advice, negative one-screen, intelligent text, intelligent search, intelligent voice, etc., the electronic device may jump to the intelligent voice setup interface if the user clicks the intelligent voice option control 42. In the intelligent voice setting interface, different wake-up modes of the voice assistant are included, such as voice wake-up, power key wake-up, breath wake-up, earphone drive-by-wire wake-up and Bluetooth equipment wake-up, and the user can select a corresponding wake-up mode to wake up the voice assistant according to requirements. It can be understood that after the user selects the wake-up mode in the intelligent voice setting interface, the voice assistant function is started, and then the personalized voice can be synthesized by using the personalized voice synthesis method of the application, so as to output voice information to perform dialogue with the user, and the like.

Meanwhile, the intelligent voice setting interface can set the broadcasting tone. In the event that the user clicks on the broadcast tone option control 43, the electronic device may jump to the interface shown in fig. 5. On this interface, the electronic device is provided with different types of timbres for user selection, e.g. an assistant (male voice), an assistant (female voice) etc. can be selected. In addition, the electronic device is further provided with a function of customizing the tone, and in the case that the user clicks the custom tone control 51, the electronic device can jump to the custom tone setting interface. In the custom tone setting interface, recorded tones, such as tone 1 and tone 2, are displayed, and a new tone control 52 is included, and the electronic device may jump to the new tone interface if the user clicks the new tone control 52. In the newly created tone interface, a guidance term for creating personal sound is presented, and if the user wants to record his own personalized voice at this time, the next control 53 can be clicked to trigger the recording of voice. In the case where the user clicks the next control 53, the electronic device may present a prompt to read the phrase, and after clicking the start control 54, the user may read the corresponding phrase (also referred to as the registered text), and the electronic device may record the user's voice. Optionally, when the user self-defines tone color registration, the embodiment of the application can be realized only by reading N phrases (N is less than or equal to 2).

In some cases, if the user has interference or abnormal reading during the recording process, and the recording is not successful, the electronic device may display a prompt interface as shown in fig. 6, display a prompt message of "recording fails, please re-record", and then the user may re-click the start control 54 and read the corresponding phrase.

In other cases, if the user has recorded successfully, the electronic device may display an interface as shown in FIG. 7 on which a "play records" control 71 is displayed, and the user may click on the "play records" control 71 to listen to the content just recorded. If the user is not satisfied with the recorded content, the user may click on the previous control 72 and the electronic device reenters the interface shown in FIG. 5 for recording. If the user is satisfied with the recorded content, then next control 73 may be clicked and the electronic device enters a composite tone interface on which the electronic device may be presented with a default composite phrase and corresponding composite tone. For example, the default synthesized phrase is "an alarm clock with 8 am points in tomorrow for you," and the corresponding synthesized tone can be listened to through the "listening test" control 74, it can be understood that the synthesized tone is obtained by analyzing the sound recorded by the user by the electronic device, extracting the tone characteristics of the user, and then using the tone characteristics to infer the synthesized tone obtained by reasoning the synthesized phrase, that is, the synthesized phrase can be read by simulating the sound of the user. At this time, if the user satisfies the effect of synthesizing the tone, the user can click the completion control 75, that is, the process of customizing the tone is completed, and then the subsequent user can correspondingly experience the personalized voice synthesis function when using the voice assistant function, and the electronic device can convert the text information of the replying user into personalized voice for outputting.

According to the scene description, when the user registers the custom tone, the user can finish the process by only reading few phrases, and reading multiple sentences is not needed, so that the operation is simple, and the user experience is improved.

It can be understood that, in the context of the AI call function, the user may also use a manner similar to the foregoing to customize tone color, so that the electronic device may convert the text information input by the user into personalized voice and send the personalized voice to the other electronic device, that is, simulate the tone color of the user and send the call content to the other electronic device. It can be further understood that, besides the above-mentioned scenes of the voice assistant function and the AI call function, in some other scenes, for example, a short video dubbing scene, a personalized voice synthesis function may also be used, and after the user defines the tone, the electronic device may output the speech of the short video in the voice of the user, so as to achieve the effect of in-person dubbing.

In the above scenario, if the user clicks the completion control 75 to complete the custom tone process, the electronic device may perform audio feature extraction on the sound recorded by the user to obtain the audio feature corresponding to the user, and if the subsequent user inputs text information or the electronic device obtains text information itself, the electronic device may generate personalized speech by reasoning the text information based on the constructed audio model and combining the audio feature corresponding to the user. The following will describe in detail the process of the personalized speech synthesis method according to the text-independent audio model according to the embodiment of the present application.

Firstly, the audio model of the embodiment of the application needs to be trained to obtain a converged network model. In some embodiments, the training process of the audio model may be performed by a cloud server, and the audio model may further include an encoder and a decoder. Before training, the cloud server may obtain multiple sets of training data, where each set of training data may include a piece of text information and a piece of speech information, and speech information in different sets of training data may be recorded by different users. For example, the text information in a certain set of training data is "the highest temperature is 25 degrees celsius today", and the voice information is the voice of a user reading "the highest temperature is 25 degrees celsius today". For each set of training data, the cloud server may perform the following processing: taking training data A as an example, the cloud server respectively performs feature extraction on text information and voice information in the training data A to obtain semantic features (semantic) and acoustic features (regional), then constructs text-independent campt based on the semantic features and the acoustic features, and further decodes the constructed campt to obtain an output result. In other embodiments, the training process of the audio model may also be performed by the electronic device if the electronic device has model training capabilities.

As shown in fig. 8, the cloud server performs feature extraction on the text information, and the process of obtaining the semantic features may include a front-end process, a dimension mapping process (embedding), an encoding process (encoder), and a duration stretching process (LR), where the front-end process may include a text regularization process, a prosody prediction process, and a phonetic notation process. In some embodiments, the process of feature extraction of textual information to obtain semantic features may be performed by a semantic encoder.

In daily life, some characters are abbreviated or abbreviated, so that the cloud server needs to standardize the characters, for example, convert telephones, time, money, units, symbols, mailboxes, dates and the like into standardized characters, that is, perform text regularization, and for example, sep.11th needs to be expanded and written into September Eleventh with full spelling. Optionally, the cloud server may perform text regularization on the text information by using a regular expression. Then, as some words need to be stopped or re-read during reading, if the stop is inaccurate, the problems of discontinuous and unnatural reading occur, and even the expression of the corresponding voice information can be affected, so that the cloud server also needs to perform prosody prediction processing on the word information, and the stop mode can be obtained after prosody prediction is performed on the word with the minimum air temperature of 15 ℃ today, for example, the word with the minimum air temperature of 15 ℃ today is obtained. Alternatively, the cloud server may use a deep network for prosody predictions, where the deep network used is trained using prosody text data. Then, because some characters are multi-tone characters and different pronunciations are carried out in different words, the cloud server needs to carry out phonetic notation processing on the character information, and the pronunciation of each character in the character information is accurately judged. Optionally, the cloud server may also use a deep network to convert text information into pinyin to solve the problem of polyphones, where the deep network is obtained by training with polyphone data. Optionally, after the phonetic notation processing, the cloud server may obtain phoneme information (phone) corresponding to the text information.

After performing front-end processing on the text information, the cloud server may perform dimension mapping processing (embedding), encoding processing (encoding) and duration stretching processing on the text information. Wherein embedding refers to the process of mapping high-dimensional data (e.g., text, pictures, audio) to a low-dimensional space; the encoder refers to a process of carrying out feature integration on input text information and converting a feature block of high-level abstraction; the length stretching process refers to stretching of reading length of the text information, namely predicting length of each text (or phoneme) when reading, adding the length information into the text information to obtain total length frame information corresponding to the text information, for example, length of reading one text by a user is n seconds, and then total length frame information corresponding to sentences containing 10 texts is 10×n seconds. Optionally, when the cloud server performs the length stretching process on the text information, the corresponding stretching length may be determined based on the voice information in the training data a, and because the text information and the voice information in the training data a are corresponding, after the alignment process, the voice length corresponding to the voice information may be used as the length to stretch the text information.

Through the processing, the cloud server can obtain semantic features (semantic) corresponding to the text information.

Meanwhile, the cloud server also performs feature extraction on the voice information to obtain acoustic features (acoustic), that is, the cloud server can extract the semantic features and extract the acoustic features in parallel. With continued reference to fig. 8, the cloud server performs feature extraction on the voice information, and the process of obtaining the acoustic features may include an audio encoding process, a multi-channel mapping process (embedding), and a multi-channel accumulation process.

The cloud server may perform audio encoding processing on the voice information by using an audio encoder, and the obtained acoustic vector corresponds to the voice information, where the acoustic vector may be a multi-channel multi-frame feature vector, for example, an 8-channel feature vector, or may be a feature vector of other channels, which is not limited in the embodiment of the present application. Then, the cloud server may mask (mask) the feature vector of each channel in the acoustic vector and then perform embedding to obtain embedding of multiple channels, for example, obtain embedding of 8 channels. Alternatively, each channel feature vector embedding may be implemented in the form of an access_ semb _i= embedding (mask), where masking functions to mask some frames in the multi-frame feature vector for subsequent decoder reasoning to train the accuracy of the reasoning. Then, the cloud server may perform weighted accumulation on embedding of the multiple channels to obtain a feature vector of one channel, i.e. obtain an acoustic feature (acoustic). Optionally, the cloud server may calculate the acoustic feature according to the spatial= Σ ^N _i=1 (w_i×spatial_ semb _i), where w_i is a weight coefficient corresponding to a channel, and N is the number of channels, for example, n=8.

Through the processing, the cloud server can obtain acoustic features (acoustic) corresponding to the voice information. It can be understood that, when the cloud server stretches the duration of the text information, the length of the finally obtained semantic feature is equal to the length of the acoustic feature, which can be determined based on the voice duration corresponding to the voice information.

After deriving the semantic and acoustic features, the cloud server may begin constructing text-independent campts (i.e., the core of the audio model, requiring learning of the audio features in the campt), with continued reference to fig. 8, the process may include: a random number m is first generated to divide the semantic feature and the acoustic feature into two parts according to the random number m, respectively, wherein m is smaller than the length of the semantic feature (or acoustic feature), preferably,×len（semantic）≤m≤/>X len (semmantic), len (semmantic) is the length of the semantic feature. The cloud server then divides the semantic feature into two parts of sample_se and target_se according to the random number m, and divides the acoustic feature into two parts of sample_ac and target_ac. And then the cloud server fuses the divided promtt_se and target_se, promtt_ac and target_ac through an audio model to obtain a target promtt and a target. Optionally, the target sample=sample_ac, and the target target=a×target_se+b×target_ac, where a and b are weight coefficients, it can be seen that the target sample is irrelevant to semantic features, that is, text, and then the target sample may be used as prompt information of a subsequent decoder, so that the decoder adds a tone for text information according to the audio features carried in the target sample, that is, the tone of the output voice is as close as possible to the tone corresponding to the input voice information, and the target may be used as learning content of the audio model.

After the target prompt and the target are obtained, the cloud server can input the target prompt and the target into the decoder to be decoded, an output result of voice is obtained, then the characteristics of the output result can be compared with the acoustic characteristics of the input voice information, a loss function is calculated, so that the parameter value of the audio model is adjusted according to the loss function, and finally the converged audio model is obtained. Alternatively, the input manner of the decoder may include decoder_in=cat (target sample, target).

For the training process of the audio model, when the cloud server performs the length stretching processing on the text information, the stretching length is determined because the cloud server has the voice information corresponding to the text information, and when the audio model is used for personalized voice generation of the text information acquired by the electronic equipment, the acquired text information does not have the corresponding voice information to determine the stretching length, so that the stretching length corresponding to the text information needs to be predicted. Therefore, in the training process of the audio model, the embodiment of the application can train a duration prediction model for predicting the stretching duration of the text information later.

In some embodiments, the cloud server may acquire the encoder result corresponding to the chinese information in the training data and input the segmented prompt_ac of the acoustic feature into the network model for training, and train through a cross attention mechanism (cross attention) to obtain the duration prediction model. However, because the information of the acoustic features carried by the prompt_ac is more, for example, semantic information, tone information, sound speed information, even environmental noise information and the like of the user are carried, if the prompt_ac is adopted as input data for training, the duration prediction model may be interfered by more information in the learning and training process, the predicted stretching duration may be averaged, so that the convergence effect of the duration prediction model is poor, and the effect of the personalized speech generated later is poor.

In other embodiments, to improve the training accuracy of the duration prediction model, the embodiment performs information dimension reduction on input data during training, and adopts a two-stage duration prediction model to train to obtain the duration prediction model with better convergence effect.

Specifically, in the training process of the audio model, when the semantic feature (semantic) or the encoder result corresponding to the text information is obtained, the semantic feature may be aligned with the acoustic feature, that is, the stretching duration (duration) corresponding to the semantic feature may also be obtained, so that the cloud server may obtain the semantic feature (semantic), the stretching duration (duration), and the acoustic feature (acousic) corresponding to the voice information. Then, as shown in fig. 9, the cloud server may divide the semantic feature, the stretching duration, and the acoustic feature, and may generate a random number k to divide the semantic feature, the stretching duration, and the acoustic feature into a plurality of parts according to the random number k, where k is smaller than the length of the semantic feature (or the acoustic feature). When dividing, considering that there may be blank voice for a period of time just beginning when a user speaks, if the semantic features and the acoustic features are divided into front and rear parts when the audio model is referred to for training, some blank values may exist in the features of the front part, and if the features of the front part are marked as prompt to be used as prompt information subsequently, the blank values affect prediction accuracy. Therefore, in this embodiment, the cloud end server may intercept from the middle part of the semantic feature, the stretching duration and the acoustic feature according to the random number k, and then combine the remaining parts to obtain the samples and targets of each feature.

For example, as shown in fig. 9, for the semantic feature (semantic), a feature of k length is taken as a sample_se from the first position, and then partial features before and after the sample_se are combined to be a target_se. For the stretch duration (duration) feature, the k-length feature is taken as a sample_dur from the first position, and partial features before and after the sample_dur are combined to be taken as a target_dur. For acoustic features (acoustic), the feature of k length is taken from the first position as a sample_ac, and then partial features before and after the sample_ac are combined to be taken as a target_ac. It will be appreciated that the first position may be a preset intermediate position, or may be a position determined according to another random number t, which is not limited in the embodiment of the present application. It can be further understood that the split promt_se, promt_dur and promt_ac are features corresponding to the same segment of position in the text information, the stretching time length and the voice information.

Alternatively, the range of values of the random number k may be:×len（semantic）≤k≤/> x len (semmantic), len (semmantic) is the length of the semantic feature (or the length of the acoustic feature). The first location may be a location one third to one half of the semantic features (or acoustic features, stretch duration features).

After the acquisition of the prot_se and target_se, prot_dur and target_dur, prot_ac and target_ac, the two-stage duration prediction model may be trained as input data. In the embodiment of the application, the two-stage duration prediction model can be respectively recorded as a first duration prediction model and a second duration prediction model, the first duration prediction model is trained first, the cloud server can input the promtt_se and the promtt_ac into the first duration prediction model, and the first duration pmt_dur_ predict when the user outputs a phoneme corresponding to the promtt_se is predicted through the audio characteristic promtt_ac in the voice information. Also, since the actual stretch time length corresponding to the promt_se is known by promt_dur, a first loss (promt_duration_loss) between the pmt_dur_ predict and the promt_dur may be calculated, and model parameters of the first time length prediction model may be adjusted based on the first loss until the training converges. And then training the second duration prediction model, and the cloud server can input the prompt_dur and the target_se into the second duration prediction model, and predict the second duration tg_dur_ predict when the user outputs the phonemes corresponding to the target_se by using the prompt_dur as prompt information. Also, since the actual stretch duration target_dur corresponding to target_se is known, a second loss (duration_loss) between tg_dur_ predict and target_dur may be calculated, and model parameters of the second duration prediction model may be adjusted based on the second loss until the training converges.

Therefore, the cloud server trains the first time length prediction model and the second time length prediction model, and in the process of training the second time length prediction model, as the input is the prompt_dur and the target_se, compared with the prompt_ac, the information quantity is greatly reduced, so that the interference of irrelevant information can be reduced, the convergence effect of the time length prediction model is improved, and the accuracy of the predicted voice time length is further improved.

In addition, in this embodiment, the cloud end server intercepts the intermediate portion of the semantic feature, the stretching time length and the acoustic feature according to the random number k to obtain the corresponding sample and target, so when the target sample and target are to be constructed in the training process of the audio model, the cloud end server can also be constructed according to the sample_se and target_se, sample_ac and target_ac obtained in this embodiment, and then inputs the target sample and target into the decoder to decode, so as to participate in the training process of the audio model, and meanwhile, the training precision of the audio model is improved.

Through the above process, the cloud server constructs the audio model and the secondary duration prediction model required by the embodiment of the application, if the user registers the custom tone on the electronic device, the audio model can acquire the audio characteristics of the custom tone, for example, in the scene of a voice assistant function, the audio model can generate personalized voice by combining the audio characteristics and the queried text information reasoning, and in the scene of an AI call function, the audio model can generate personalized voice by combining the audio characteristics and the text information reasoning input by the user. In some embodiments, the process of generating personalized voice by audio model reasoning can still be performed at the cloud server, so that the electronic device is required to send the voice information and the subsequent text information when the user custom tone color to the cloud server. In other embodiments, the process of generating personalized speech by audio model reasoning can also be performed in the electronic device, and then the cloud server is required to migrate the training converged audio model to the electronic device. The following description will be made with a cloud server.

As shown in fig. 10, after the user performs the custom tone registration, the electronic device may send the voice information recorded during registration to the cloud server. The cloud server can conduct audio feature extraction on the voice information to obtain audio features corresponding to the user. Optionally, the cloud server may correspond the identifier and the audio feature of the user and store the identifier and the audio feature in the feature library, where the identifier of the user may be information such as a user name used when the user registers on the electronic device, and a device identifier of the used electronic device, which is not limited in the embodiment of the present application. Then, if the cloud server receives text information (text information input by a user or text information queried by the background of the electronic device) sent by the electronic device, the cloud server can perform front-end processing such as text regularization processing, prosody prediction processing and phonetic transcription processing on the text information, input the text information after the front-end processing and audio features corresponding to the user into an audio model for reasoning, output personalized voice and return to the electronic device. Alternatively, the audio model may include an encoder for feature extraction of text information, an acoustic model for converting text information into audio acoustic features in combination with audio features (i.e., a process of constructing a prompt), and a decoder for converting the audio acoustic features into audio for output. It can be understood that, unlike the above-mentioned audio model training process, the text information and the speech information in the training data in the training process are corresponding in groups, and in the process of generating the personalized speech by the reasoning, the text information received by the cloud server is changed in real time, and the audio features used are extracted when the user is registered by the user-defined tone, so as to output the text information changed in real time as the personalized speech with the user tone based on the audio features of the user.

For the embodiment shown in fig. 10, the execution procedure of the cloud server is described in conjunction with fig. 11: after receiving the voice information recorded during user-defined tone registration, the cloud server can perform audio encoding processing, multichannel mapping processing (embedding) and multichannel accumulation processing on the voice information to obtain acoustic features (acousic) corresponding to the voice information, serve as audio features corresponding to the user, and store the audio features and the user identification in a feature library in a corresponding manner. Under the condition that the cloud server receives the text information, front-end processing can be carried out on the text information, and phoneme information (phonme) corresponding to the text information is obtained. After performing front-end processing on the text information, the cloud server may perform dimension mapping processing (embedding), encoding processing (encoding) and duration stretching processing on the text information (phoneme information) to obtain semantic features (semantic) corresponding to the text information. Next, the cloud server may combine the audio feature (audio) and the semantic feature (semantic) through the audio model, and input the audio feature (audio) as a prompt to the decoder for processing, and output the personalized speech. It can be understood that the cloud server can return the obtained personalized voice to the electronic device, and the electronic device outputs the personalized voice to the user through a voice assistant function or transmits the personalized voice to the opposite-end electronic device through an AI call function.

When the cloud server performs the length stretching processing on the text information, because the stretching length is determined without the voice information corresponding to the text information, the cloud server predicts the stretching length based on the constructed secondary length prediction model. As shown in fig. 12, in the process of processing the cloud server to obtain the acoustic feature (active) corresponding to the user-defined tone registration, the registered text input during registration may also be processed, for example, front-end processing including text regularization processing, prosody prediction processing, phonetic transcription processing, etc. is performed on the registered text, so as to obtain phoneme information (phone) corresponding to the registered text, where the phoneme information may be used as the text feature corresponding to the registered text. And then inputting the acoustic feature (active) and the phoneme information (phone) corresponding to the registered text into a first time length prediction model, outputting a first prediction time length, taking the result of the cloud server after encoding (encoding) the received text information as semantic, inputting the first prediction time length and the semantic into a second time length prediction model, taking the first prediction time length as prompt information, and outputting a second prediction time length, wherein the second prediction time length is the stretching time length when the text information is subjected to time length stretching processing.

According to the processing process of the audio model, the audio model irrelevant to the text is constructed during training, the prompt information is taken as the prompt information, and voice information corresponding to tone carried in the prompt can be inferred by subsequently inputting any text information, so that when the voice model is used in an actual scene, a user only needs to record a shorter sentence during registration, the extracted audio characteristic of the user is taken as the prompt to input the audio model, personalized voice can be output, and the post-fine adjustment process is not needed, so that the data processing efficiency is higher. Meanwhile, when the personalized voice conversion is carried out on the text information, the duration prediction stage is not interfered by more information in acoustic features, the duration prediction is carried out by taking the first-stage prediction duration as prompt information, the accuracy of the predicted voice duration is higher, the characteristics of the rhythm and the like of the personalized voice synthesized by the audio model are closer to the actual characteristics of a user, and the voice synthesis effect is better.

In other embodiments, the process of generating personalized speech by the audio model reasoning may be performed by the electronic device, where the technical principle of the process performed by the electronic device is similar to the technical principle when the process is performed by the cloud server, and the specific process is not repeated.

It is understood that the software system of the electronic device may employ a layered architecture, an event driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture. If the electronic device employs audio model reasoning to generate personalized speech, it should be executed in conjunction with the software system architecture of the electronic device. In the embodiment of the application, an Android system with a layered architecture is taken as an example, and the software structure of the electronic equipment is illustrated.

Fig. 13 is a software configuration block diagram of an electronic device according to an embodiment of the present application. The layered architecture divides the software into several layers, each with distinct roles and branches. The layers communicate with each other through a software interface. In some embodiments, the Android system is divided into four layers, from top to bottom, an application layer, an application framework layer, an Zhuoyun rows (Android runtime) and system libraries, and a kernel layer, respectively. The application layer may include a series of application packages, such as Android application packages (Android application package, APK).

As shown in fig. 13, the application package may include applications such as a camera, a gallery, a calendar, a call, a map, a navigation, a WLAN, music, a video, a short message, and the like, and may further include an APK integrated with an audio model, where the APK integrated with an audio model may implement a function of user-defined tone registration and inference generation of personalized speech according to audio features and input text information of a user.

The application framework layer provides an application programming interface (application programming interface, API) and programming framework for the application of the application layer. The application framework layer includes a number of predefined functions.

As shown in fig. 13, the application framework layer may include a window manager, a content provider, a view system, a phone manager, a resource manager, a notification manager, and the like.

The content provider is used to store and retrieve data and make such data accessible to applications. The data may include video, images, audio, calls made and received, browsing history and bookmarks, phonebooks, etc.

The view system includes visual controls, such as controls to display text, controls to display pictures, and the like. The view system may be used to build applications. The display interface may be composed of one or more views. For example, a display interface including a text message notification icon may include a view displaying text and a view displaying a picture.

The telephony manager is used to provide the communication functions of the electronic device 100. Such as the management of call status (including on, hung-up, etc.).

Android runtime include core libraries and virtual machines. Android runtime is responsible for scheduling and management of the android system.

The core library consists of two parts: one part is a function which needs to be called by java language, and the other part is a core library of android.

The system library may include a plurality of functional modules. For example: surface manager (surface manager), media library (media library), three-dimensional graphics processing library (e.g., openGL ES), 2D graphics engine (e.g., SGL), etc.

The kernel layer is a layer between hardware and software. The inner core layer at least comprises a display driver, a camera driver, an audio driver and a sensor driver.

In the case where the electronic device performs the above personalized speech synthesis method, fig. 14 is a flowchart of an exemplary personalized speech synthesis method provided in an embodiment of the present application, and as shown in fig. 14, the method may include:

S101, under the condition that the electronic equipment starts a voice synthesis function, text information which needs to be subjected to voice synthesis in the electronic equipment is obtained.

The text information comprises text information input by a user and text information generated by the electronic equipment, the text information input by the user can be information to be replied to a user of the opposite side under the scene of an AI call function, and the text information generated by the electronic equipment can be information to be replied to the user inquired by the electronic equipment under the scene of a voice assistant function.

S102, acquiring the audio features and the text features corresponding to the user.

The audio features are generated according to voice information input when the user registers the voice synthesis function, and the text features are generated according to registered text information read by the user when the user registers the voice synthesis function.

S103, determining the stretching time length when the text information is synthesized by voice based on the audio frequency characteristic, the text characteristic and the coding result corresponding to the text information.

The process of predicting the stretching time period in S103 may be referred to the description of fig. 12, and will not be described herein.

S104, based on the stretching time length, performing voice synthesis on the text information and the audio features corresponding to the user through the audio model, and outputting personalized voice.

The audio model is a model which is obtained by training according to the first text information and the first voice information in the training data and is irrelevant to the text. The processing procedure of the audio model may be described in detail in the above embodiments, and will not be described herein.

Examples of the personalized speech synthesis method provided by the embodiment of the application are described in detail above. It will be appreciated that the electronic device, in order to achieve the above-described functions, includes corresponding hardware and/or software modules that perform the respective functions. Those of skill in the art will readily appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Those skilled in the art may implement the described functionality using different approaches for each particular application in conjunction with the embodiments, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The embodiment of the application can divide the functional modules of the electronic device according to the method example, for example, each function can be divided into each functional module, for example, a detection unit, a processing unit, a display unit, and the like, and two or more functions can be integrated into one module. The integrated modules may be implemented in hardware or in software functional modules. It should be noted that, in the embodiment of the present application, the division of the modules is schematic, which is merely a logic function division, and other division manners may be implemented in actual implementation.

It should be noted that, all relevant contents of each step related to the above method embodiment may be cited to the functional description of the corresponding functional module, which is not described herein.

The electronic device provided in this embodiment is configured to execute the personalized speech synthesis method, so that the same effects as those of the implementation method can be achieved.

In case an integrated unit is employed, the electronic device may further comprise a processing module, a storage module and a communication module. The processing module can be used for controlling and managing the actions of the electronic equipment. The memory module may be used to support the electronic device to execute stored program code, data, etc. And the communication module can be used for supporting the communication between the electronic device and other devices.

Wherein the processing module may be a processor or a controller. Which may implement or perform the various exemplary logic blocks, modules and circuits described in connection with this disclosure. A processor may also be a combination that performs computing functions, e.g., including one or more microprocessors, digital Signal Processing (DSP) and a combination of microprocessors, and the like. The memory module may be a memory. The communication module can be a radio frequency circuit, a Bluetooth chip, a Wi-Fi chip and other equipment which interact with other electronic equipment.

In one embodiment, when the processing module is a processor and the storage module is a memory, the electronic device according to this embodiment may be a device having the structure shown in fig. 3.

The embodiment of the application also provides a computer readable storage medium, in which a computer program is stored, which when executed by a processor, causes the processor to execute the personalized speech synthesis method of any of the above embodiments.

The embodiment of the application also provides a computer program product, which when run on a computer, causes the computer to execute the above related steps to implement the personalized speech synthesis method in the above embodiment.

In addition, embodiments of the present application also provide an apparatus, which may be embodied as a chip, component or module, which may include a processor and a memory coupled to each other; the memory is used for storing computer-executable instructions, and when the device is operated, the processor can execute the computer-executable instructions stored in the memory, so that the chip executes the personalized speech synthesis method in each method embodiment.

The electronic device, the computer readable storage medium, the computer program product or the chip provided in this embodiment are used to execute the corresponding method provided above, so that the beneficial effects thereof can be referred to the beneficial effects in the corresponding method provided above, and will not be described herein.

It will be appreciated by those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional modules is illustrated, and in practical application, the above-described functional allocation may be performed by different functional modules according to needs, i.e. the internal structure of the apparatus is divided into different functional modules to perform all or part of the functions described above.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another apparatus, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and the parts shown as units may be one physical unit or a plurality of physical units, may be located in one place, or may be distributed in a plurality of different places. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a readable storage medium. Based on such understanding, the technical solution of the embodiments of the present application may be essentially or a part contributing to the prior art or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, including several instructions for causing a device (may be a single-chip microcomputer, a chip or the like) or a processor (processor) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims

1. A method of personalized speech synthesis, the method comprising:

under the condition that the electronic equipment starts a voice synthesis function, acquiring text information which is required to be subjected to voice synthesis in the electronic equipment, wherein the text information comprises text information input by a user and text information generated by the electronic equipment;

Acquiring an audio feature and a text feature corresponding to the user, wherein the audio feature is generated according to voice information recorded when the user performs voice synthesis function registration, and the text feature is generated according to registered text information read by the user when the user performs voice synthesis function registration;

Determining the stretching time length when the text information is subjected to voice synthesis based on the audio feature, the text feature and the coding result corresponding to the text information;

And based on the stretching time length, performing voice synthesis on the text information and the audio features through an audio model, and outputting personalized voice, wherein the audio model is a model which is obtained by training according to the first text information and the first voice information in training data and is irrelevant to a text.

2. The method according to claim 1, wherein determining a stretch duration when speech synthesis is performed on the text information based on the audio feature, the text feature, and the encoding result corresponding to the text information includes:

Inputting the audio features and the text features into a first time length prediction model, and outputting a first prediction time length;

inputting the coding results corresponding to the first predicted time length and the text information into a second time length prediction model, and outputting the stretching time length.

3. The method of claim 2, wherein the training of the first and second time duration prediction models comprises:

Acquiring a first semantic feature corresponding to the first text information and a first acoustic feature corresponding to the first voice information;

aligning the first semantic features with the first acoustic features, and determining first stretching duration features corresponding to the first semantic features;

generating a random number k, dividing the first semantic feature into a first sub-feature and a second sub-feature according to the random number k, dividing the first stretching duration feature into a third sub-feature and a fourth sub-feature, and dividing the first acoustic feature into a fifth sub-feature and a sixth sub-feature;

inputting the first sub-feature and the fifth sub-feature into the first time length prediction model, and adjusting the parameter value of the first time length prediction model according to the loss between the output result and the third sub-feature until the first time length prediction model converges;

And inputting the second sub-feature and the third sub-feature into the second duration prediction model, and adjusting the parameter value of the second duration prediction model according to the loss between the output result and the fourth sub-feature until the second duration prediction model converges.

4. The method of claim 3, wherein the random number k has a value in the range of [ ]，/>The L is the data length of the first semantic feature.

5. The method according to claim 3 or 4, wherein the dividing the first semantic feature into a first sub-feature and a second sub-feature according to the random number k comprises:

taking k-length features from a first position of the first semantic features as the first sub-features, and combining the features of the rest as the second sub-features;

The dividing the first stretching time length feature into a third sub-feature and a fourth sub-feature includes:

Taking the k-length characteristic as the third sub-characteristic from the first position of the first stretching duration characteristic, and combining the characteristics of the rest part as the fourth sub-characteristic;

the dividing the first acoustic feature into a fifth sub-feature and a sixth sub-feature includes:

and taking the characteristic of k length from the first position of the first acoustic characteristic as the fifth sub-characteristic, and taking the characteristic combination of the rest part as the sixth sub-characteristic.

6. The method of claim 1, wherein the speech synthesis of the text information and the audio feature by an audio model based on the stretch duration, outputting personalized speech, comprises:

Extracting characteristics of the text information through the audio model to obtain semantic characteristics of the length of the stretching duration corresponding to the text information;

And performing voice synthesis on the semantic features and the audio features through the audio model, and outputting the personalized voice.

7. The method of claim 6, wherein the speech synthesis of the semantic features and the audio features by the audio model, outputting the personalized speech, comprises:

And taking the audio features as tone color prompts, fusing and decoding the semantic features and the audio features through the audio model to obtain and output the personalized voice.

8. The method according to claim 6 or 7, wherein the feature extraction of the text information by the audio model to obtain semantic features of the length of the stretching duration corresponding to the text information includes:

performing front-end processing on the text information to obtain phoneme information corresponding to the text information, wherein the front-end processing comprises at least one processing procedure of text regularization processing, prosody prediction processing and phonetic notation processing;

and performing dimension mapping processing, coding processing and long-time stretching processing on the phoneme information through the audio model to obtain semantic features with the length of the stretching duration.

9. The method of claim 1, wherein the obtaining the audio feature and the text feature corresponding to the user comprises:

Acquiring voice information input when the user performs voice synthesis function registration and registration text information read by the user when the user performs voice synthesis function registration;

performing audio coding processing, multichannel mapping processing and multichannel accumulating processing on the recorded voice information to obtain audio features corresponding to the user;

and performing front-end processing on the registered text information to obtain the text characteristics.

10. The method of claim 9, wherein the number of short sentences N corresponding to the entered speech information is less than or equal to 2.

11. The method according to claim 9 or 10, wherein after said deriving the audio features corresponding to the user, the method further comprises:

And storing the audio features corresponding to the users and the identifications of the users in a feature library in an associated mode.

12. The method of claim 1, wherein the text information is text information generated by the electronic device in a scenario in which the electronic device is in use with a voice assistant, and wherein the text information is text information entered by the user in a scenario in which the electronic device is in use with an artificial intelligence AI call.

13. A method of training a duration prediction model, the method comprising:

acquiring first text information and first voice information in training data;

Extracting a first semantic feature corresponding to the first text information and a first acoustic feature corresponding to the first voice information;

14. An electronic device, comprising:

one or more processors;

One or more memories;

the memory stores one or more programs that, when executed by the processor, cause the electronic device to perform the method of any of claims 1-13.

15. A server, comprising:

one or more processors;

One or more memories;

The memory stores one or more programs that, when executed by the processor, cause the server to perform the method of any of claims 1-13.

16. The server of claim 15, wherein the one or more programs, when executed by the processor, cause the server to further perform:

Personalized speech is sent to the electronic device.

17. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program which, when executed by a processor, causes the processor to perform the method of any of claims 1 to 13.