CN113299273A

CN113299273A - Voice data synthesis method, terminal device, and computer-readable storage medium

Info

Publication number: CN113299273A
Application number: CN202110553610.4A
Authority: CN
Inventors: 李智豪; 陈思云
Original assignee: Guangzhou Xiaopeng Smart Charge Technology Co Ltd
Current assignee: Guangzhou Xiaopeng Motors Technology Co Ltd
Priority date: 2021-05-20
Filing date: 2021-05-20
Publication date: 2021-08-24
Anticipated expiration: 2041-05-20
Also published as: CN113299273B

Abstract

The embodiment of the application discloses a voice data synthesis method, terminal equipment and a computer readable storage medium, wherein the method comprises the following steps: the method comprises the steps of firstly receiving a voice broadcast request, then sending a synthesis request to a cloud end according to the voice broadcast request, wherein the synthesis request is used for indicating the cloud end to operate a cloud end voice synthesis process, synthesizing voice data corresponding to the voice broadcast request through the cloud end voice synthesis process, operating a local voice synthesis process when target voice data returned by the cloud end are not received in a first time period, and synthesizing the voice data corresponding to the voice broadcast request through the local voice synthesis process. By implementing the method, the processing effect of converting the characters into the voice can be improved.

Description

Voice data synthesis method, terminal device, and computer-readable storage medium

Technical Field

The present application relates to the field of terminal device applications, and in particular, to a speech data synthesis method, a terminal device, and a computer-readable storage medium.

Background

Text-to-speech (TTS) is often used in a context of human-computer interaction to provide feedback to a user. TTS often involves arithmetic processing of text-to-speech, and the existing text-to-speech processing mode often is single local device processing or cloud processing.

In practice, it is found that when local equipment performs text-to-speech conversion, although the time delay of speech data is small, the computational power of the local equipment is limited, most local equipment can only perform computation by using a simple model, and the obtained speech data has general tone quality. Although the cloud has strong computing power and can use a complex model for computing to obtain voice data with excellent tone quality, the cloud is greatly influenced by network conditions, and the time delay of the voice data is often large. Therefore, the conventional text-to-speech processing effect is often poor.

Disclosure of Invention

The embodiment of the application provides a voice data synthesis method, terminal equipment and a computer readable storage medium, which can improve the processing effect of converting characters into voice.

A first aspect of an embodiment of the present application provides a speech data synthesis method, including:

receiving a voice broadcast request;

sending a synthesis request to a cloud end according to the voice broadcast request, wherein the synthesis request is used for indicating the cloud end to operate a cloud end voice synthesis process, and synthesizing voice data corresponding to the voice broadcast request through the cloud end voice synthesis process;

and if the target voice data returned by the cloud end is not received in the first time period, running a local voice synthesis process, and synthesizing the voice data corresponding to the voice broadcast request through the local voice synthesis process.

As an optional implementation manner, in the first aspect of the embodiment of the present application, after the sending a synthesis request to a cloud according to the voice broadcast request, the method further includes:

and if the target voice data returned by the cloud terminal is received in the first time period, broadcasting the voice data synthesized by the cloud terminal voice synthesis process.

As an optional implementation manner, in the first aspect of the embodiment of the present application, after the voice data corresponding to the voice broadcast request is synthesized by the local voice synthesis process, the method further includes:

if target voice data returned by the cloud end is received in a second time period for running the local voice synthesis process, stopping running the local voice synthesis process, and broadcasting voice data synthesized by the cloud end voice synthesis process;

and if the target voice data returned by the cloud end is still not received in a second time period for running the local voice synthesis process, sending a termination request to the cloud end, and broadcasting the voice data synthesized by the local voice synthesis process, wherein the termination request is used for indicating the cloud end to terminate running the cloud end voice synthesis process.

As an optional implementation manner, in the first aspect of the embodiment of the present application, if target speech data returned by the cloud is still not received within a second time period in which the local speech synthesis process is running, sending a termination request to the cloud includes:

and if the target voice data returned by the cloud end is still not received in a second time period for running the local voice synthesis process, and the first frame voice data synthesized by the local voice synthesis process is received in the second time period, sending a termination request to the cloud end.

As an optional implementation manner, in the first aspect of this embodiment of the present application, the method further includes:

if the target voice data returned by the cloud end is still not received in a second time period for running the local voice synthesis process, and the first frame voice data synthesized by the local voice synthesis process is not received in the second time period, continuing waiting;

if target voice data returned by the cloud end is received firstly, terminating the local voice synthesis process, and broadcasting voice data synthesized by the cloud end voice synthesis process;

if the first frame of voice data synthesized by the local voice synthesis process is received at first, sending a termination request to the cloud end, and broadcasting the voice data synthesized by the local voice synthesis process, wherein the termination request is used for indicating the cloud end to terminate the operation of the cloud end voice synthesis process.

As an optional implementation manner, in the first aspect of the embodiment of the present application, after the receiving the voice broadcast request, the method further includes:

acquiring a target fusion strategy of a target text, wherein the target text is any text in text contents corresponding to the voice broadcast request;

if the target voice data returned by the cloud end is not received in the first time period, running a local voice synthesis process, including:

and if the target voice data corresponding to the fusion strategy returned by the cloud is not received in the first time period corresponding to the target fusion strategy, running a local voice synthesis process.

As an optional implementation manner, in a first aspect of an embodiment of the present application, the obtaining a target fusion policy of a target text includes:

acquiring a strategy identifier from the voice broadcast request;

and determining the fusion strategy indicated by the strategy identification as a target fusion strategy of the target text.

acquiring a service type corresponding to the voice broadcast request;

and determining the fusion strategy matched with the service type as a target fusion strategy of the target text.

performing text feature matching on the target text by using a preset text feature matching library to obtain text features of the target text;

and determining the fusion strategy matched with the text features as a target fusion strategy of the target text.

As an optional implementation manner, in the first aspect of the embodiment of the present application, the target fusion policy includes a delay-first fusion policy, and the target speech data corresponding to the delay-first fusion policy includes the first frame speech data corresponding to the target text.

As an optional implementation manner, in the first aspect of the embodiment of the present application, the target fusion policy includes a data-first fusion policy, and the target voice data corresponding to the data-first fusion policy includes all voice data corresponding to the target text.

A second aspect of the embodiments of the present application provides a terminal device, including:

the acquisition unit is used for receiving the sent voice broadcast request;

the processing unit is used for sending a synthesis request to a cloud according to the voice broadcast request, wherein the synthesis request is used for indicating the cloud to run a cloud voice synthesis process, and synthesizing voice data corresponding to the voice broadcast request through the cloud voice synthesis process;

the processing unit is further configured to run a local voice synthesis process if the target voice data returned by the cloud is not received within a first time period, and synthesize the voice data corresponding to the voice broadcast request through the local voice synthesis process.

A third aspect of the embodiments of the present application provides a terminal device, which may include:

a memory storing executable program code;

and a processor coupled to the memory;

the processor calls the executable program code stored in the memory, and when executed by the processor, the executable program code causes the processor to implement the method according to the first aspect of the embodiments of the present application.

A fourth aspect of embodiments of the present application provides a computer-readable storage medium, on which executable program code is stored, and when the executable program code is executed by a processor, the method according to the first aspect of embodiments of the present application is implemented.

A fifth aspect of embodiments of the present application discloses a computer program product, which, when run on a computer, causes the computer to perform any one of the methods disclosed in the first aspect of embodiments of the present application.

A sixth aspect of the present embodiment discloses an application publishing platform, configured to publish a computer program product, where when the computer program product runs on a computer, the computer is caused to execute any of the methods disclosed in the first aspect of the present embodiment.

According to the technical scheme, the embodiment of the application has the following advantages:

in the embodiment of the application, a voice broadcast request is received at first, then a synthesis request is sent to the cloud according to the voice broadcast request, the synthesis request is used for indicating the cloud to run a cloud voice synthesis process, voice data corresponding to the voice broadcast request are synthesized through the cloud voice synthesis process, and when target voice data returned by the cloud are not received in a first time period, a local voice synthesis process is run, and voice data corresponding to the voice broadcast request are synthesized through the local voice synthesis process. The voice data synthesis is preferentially carried out through the cloud, so that the tone quality of the voice data is improved; in addition, if the target voice data returned by the cloud end is not received in the first time period, which indicates that the current data processing condition of the cloud end is poor, a local voice synthesis process is started, so that the problem of low processing efficiency caused by the fact that the cloud end does not feed back the target voice data in time can be solved. Therefore, by implementing the method, the cloud and the local voice data synthesis mode are combined, the quality and the processing efficiency of the voice data are considered, and the method is favorable for optimizing the processing effect of converting the text into the voice.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the following briefly introduces the embodiments and the drawings used in the description of the prior art, and obviously, the drawings in the following description are only some embodiments of the present application, and other drawings can be obtained according to the drawings.

Fig. 1A is a schematic view of a scene of a speech data synthesis method disclosed in an embodiment of the present application;

FIG. 1B is a timing diagram of a method for synthesizing speech data according to an embodiment of the present disclosure;

FIG. 2 is a flow chart of a speech synthesis method disclosed in an embodiment of the present application;

FIG. 3 is a flow chart of another speech data synthesis method disclosed in the embodiments of the present application;

FIG. 4 is a flow chart illustrating a further method for synthesizing speech data according to the embodiment of the present application;

fig. 5 is a schematic flowchart of a speech synthesis method corresponding to the delay-first fusion policy disclosed in the embodiment of the present application;

fig. 6 is a schematic flowchart of a speech synthesis method corresponding to a data-first fusion policy disclosed in an embodiment of the present application;

fig. 7 is a block diagram of a terminal device according to an embodiment of the present disclosure;

fig. 8 is another block diagram of the terminal device disclosed in the embodiment of the present application;

fig. 9 is a block diagram of another structure of a terminal device disclosed in the embodiment of the present application.

Detailed Description

The embodiment of the application provides a voice data synthesis method, terminal equipment and a computer readable storage medium, which can optimize the processing effect of converting characters into voice.

For a person skilled in the art to better understand the present application, the technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, and not all embodiments. The embodiments in the present application shall fall within the protection scope of the present application.

It is understood that the terminal device according to the embodiment of the present application may include a general electronic terminal device, such as a mobile phone, a smart phone, a portable terminal, a Personal Digital Assistant (PDA), a Portable Multimedia Player (PMP) device, a notebook Computer, a notebook (Note Pad), a Wireless Broadband (Wibro) terminal, a tablet Computer (PC), a smart PC, a Point of Sales (POS), a vehicle-mounted terminal, and the like.

The terminal device may also comprise a wearable device. The wearable device may be worn directly on the user or may be a portable electronic device integrated into the user's clothing or accessory. Wearable equipment is not only a hardware equipment, can realize powerful intelligent function through software support and data interaction, high in the clouds interaction more, for example: the system has the functions of calculation, positioning and alarming, and can be connected with a mobile phone and various terminals. Wearable devices may include, but are not limited to, wrist-supported watch types (e.g., wrist watches, wrist-supported products), foot-supported shoes types (e.g., shoes, socks, or other leg-worn products), head-supported Glass types (e.g., glasses, helmets, headbands, etc.), and various types of non-mainstream products such as smart clothing, bags, crutches, accessories, and the like.

The technical solution of the present application is further described below by way of examples.

Referring to fig. 1A, fig. 1A is a scene schematic diagram of a speech data synthesis method disclosed in an embodiment of the present application. As shown in the scenario diagram of fig. 1, the cloud 200 may refer to a cloud server, which may provide a simple, efficient, secure, reliable, and powerful computing service, and the cloud 200 may include one or more cloud servers.

Referring to fig. 1B, fig. 1B is a timing diagram of a speech data synthesis method according to an embodiment of the present application. When receiving a voice broadcast request sent by any installed application, the terminal device 100 may send a synthesis request to the cloud 200 according to the voice broadcast request, and the cloud 200 may run a cloud voice synthesis process when receiving the synthesis request to synthesize voice data corresponding to the voice broadcast request and send the synthesized voice data to the terminal device 100. If the terminal device 100 does not receive the target voice data returned by the cloud 200 within the first time period, the terminal device 100 runs a local voice synthesis process to synthesize the voice data corresponding to the voice broadcast request. The terminal device 100 preferentially obtains the voice data through the cloud 200, and if the voice data returned by the cloud 200 is not obtained in time, the local voice synthesis process is started, so that the terminal device 100 can give consideration to the quality and the processing efficiency of the voice data when acquiring the voice data, and the processing effect of converting the text into the voice of the terminal device 100 is favorably optimized.

Referring to fig. 2, fig. 2 is a schematic flow chart of a speech synthesis method according to an embodiment of the present application. The method is applicable to the terminal device, and the method can comprise the following steps:

201. and receiving a voice broadcast request.

In some embodiments, the terminal device may receive the voice broadcast request through a voice broadcast request module, where the voice broadcast request module may be a functional service module of the terminal device. It should be noted that the function service module can be called by an application running on the terminal device. Specifically, the user can operate the application program through operation modes such as voice, gestures or touch, and when detecting the user operation, the application program sends a voice broadcast request to the voice broadcast request module. By way of example, the application may include, but is not limited to, a navigation-type application, a learning-type application, a messaging-type application, and the like. If the application program is a learning application program, the voice broadcast request may include indication information for indicating to broadcast voice data corresponding to the content to be learned. If the application program is a navigation application program, the voice broadcast request may include indication information for indicating voice data corresponding to navigation information, and the navigation information may include start location information, destination location information, a navigation route, and the like; if the application program is a communication application, the voice broadcast request may include indication information for indicating to broadcast voice data corresponding to the text communication content.

202. And sending a synthesis request to the cloud according to the voice broadcast request, wherein the synthesis request is used for indicating the cloud to run a cloud voice synthesis process, and synthesizing voice data corresponding to the voice broadcast request through the cloud voice synthesis process.

In some embodiments, the synthesis request may include one or more pieces of text content corresponding to the indication information of the voice data, and the synthesis request may be used to instruct the cloud to synthesize the voice data corresponding to the text content through a cloud-based voice synthesis process. For example, if the application is a navigation application, and the user inputs "how to drive from a location a to a location B" by voice, the voice broadcast request sent by the application to the voice broadcast request module may include text content "go straight for 7 km along a c-segment road, turn left, enter a d-segment road, continue going straight for 8 km", and the synthesized request may include "go straight for 7 km along a c-segment road".

In this embodiment of the application, when the cloud receives one or more text contents, the cloud may perform speech synthesis on the received text contents by using a conversion engine running in the cloud. It should be noted that the conversion model related to the conversion engine running in the cloud is very complex, and therefore, the voice quality of the voice data obtained in the cloud voice synthesis process is high.

203. And if the target voice data returned by the cloud end is not received in the first time period, running a local voice synthesis process, and synthesizing the voice data corresponding to the voice broadcast request through the local voice synthesis process.

In this embodiment, the terminal device may preset a first time period, where the first time period may refer to a start delay time period of offline synthesis. The target speech data may include target frame speech data corresponding to the text content. For example, the target frame speech data may be the first frame speech data, the first 3 frames speech data, the first 10 frames speech data, or the entire frame speech data corresponding to the text content included in the synthesis request. Optionally, the target frame voice data may be set according to actual requirements, or may be intelligently set by the terminal device according to network conditions, which is not limited in the embodiment of the present application.

In some embodiments, the size of the first time period may be in positive correlation with the number of frames included in the target voice data, and it is understood that the larger the number of frames included in the target voice data is, the longer the first time period is, and vice versa, the shorter the first time period is. It can be understood that, if the terminal device does not receive the target voice data in the first time period, it indicates that the current data processing situation of the cloud is poor, for example, the network situation may be poor, or the load of the cloud is large, etc. may cause a situation that the voice data processing efficiency is low.

In some embodiments, if the terminal device does not receive the target voice data returned by the cloud in the first time period, the voice data synthesized by the cloud voice synthesis process is broadcasted.

By implementing the method, the terminal equipment preferentially carries out voice data synthesis through the cloud end, so that the improvement of the tone quality of the voice data is facilitated; in addition, if the target voice data returned by the cloud end is not received in the first time period, which indicates that the current data processing condition of the cloud end is poor, a local voice synthesis process is started, so that the problem of low processing efficiency caused by the fact that the cloud end does not feed back the target voice data in time can be solved. Therefore, the cloud and the local voice data synthesis mode are combined, the quality and the processing efficiency of the voice data are considered, and the processing effect of converting characters into voice is optimized.

Referring to fig. 3, fig. 3 is a schematic flowchart illustrating another speech data synthesis method according to an embodiment of the present application. May include the steps of:

301. and receiving a voice broadcast request.

302. And acquiring a target fusion strategy of a target text, wherein the target text is any text in text contents corresponding to the voice broadcast request.

In some embodiments, the target fusion policy may include a delay-first fusion policy or a data-first fusion policy. The delay-first fusion strategy is a strategy in which broadcast delay is used as priority, and is directed to a delay-sensitive TTS service, which indicates a service with a high requirement on data delay, such as a man-machine conversation (a weather query service, a date query service, a new word query service, and the like). The data priority fusion strategy is a strategy taking broadcast data integrity as priority, and aims at data integrity sensitive TTS service which indicates a service with higher requirement on data integrity, such as navigation service.

In some embodiments, the manner in which the terminal device obtains the target fusion policy of the target text may include, but is not limited to, the following manners:

in the mode 1, the terminal device obtains the policy identifier from the voice broadcast request, and determines the fusion policy indicated by the policy identifier as the target fusion policy of the target text.

In this embodiment of the application, the voice broadcast request may include a target text and a policy identifier indicating the target fusion policy. Optionally, the policy identifier may include one or a combination of numbers, text, or special characters.

In some embodiments, when detecting a user operation, an application program may obtain a target text based on the user operation, autonomously set a target fusion policy of the target text, generate a voice broadcast request according to the target text and the target fusion policy, and finally send the voice broadcast request to a voice broadcast request module. It can be understood that, if the application is a weather query application, and when an operation of querying weather by a user is detected, the application obtains a target text indicating weather, and sets a fusion policy of the target text as a delay-first fusion policy, the voice broadcast request generated by the weather query application may include the target text indicating weather and a policy identifier corresponding to the delay-first fusion policy. By implementing the method, the application program can autonomously set the fusion strategy based on the self service requirement, so that the synthesis of the voice data can be more fit with the actual requirement.

And 2, the terminal equipment acquires the service type corresponding to the voice broadcast request, and determines the fusion strategy matched with the service type as a target fusion strategy of the target text.

If the terminal device receives the voice broadcast request through the voice broadcast request module, the service type corresponding to the voice broadcast request is the service type corresponding to the application program sending the voice broadcast request. In some embodiments, the traffic type may include delay sensitive TTS traffic or data integrity sensitive TTS traffic. For example, if the application program is a weather query application program, the service type corresponding to the voice broadcast request is a delay sensitive TTS service, and if the application program is a navigation application program, the service type corresponding to the voice broadcast request is a data integrity sensitive TTS service.

In some embodiments, the terminal device may store a first fusion policy base in advance, where the first fusion policy base may include a plurality of fusion policies and an application program corresponding to each fusion policy. In some embodiments, the voice broadcast request may include an application identifier of an application program, and the terminal device may identify the application program through the application identifier in the voice broadcast request, and further search the fusion policy corresponding to the application program in the first fusion policy library. In some embodiments, the application identification may include any one or more of numbers, text, and special characters.

By implementing the method, the terminal equipment can realize the intelligent determination of the fusion strategy based on the service type, and the intelligent degree of the terminal equipment for synthesizing the voice data is favorably improved.

Mode 3, the terminal device performs text feature matching on the target text by using a preset text feature matching library to obtain text features of the target text; and determining the fusion strategy matched with the text characteristics as a target fusion strategy of the target text.

In embodiments of the present application, the text features may include delay sensitive text, data sensitive text, and the like.

In some embodiments, the terminal device may be preset with a second fusion policy library, where the second fusion policy library records multiple preset text features and a fusion policy corresponding to each preset text feature. Optionally, the terminal device may identify a key text in the target text, extract text features of the target text from the key text, match the text features of the target text with each preset text feature recorded in the second fusion policy library, determine the preset text features matched with the text features of the target text, and finally determine the target fusion policy of the target text according to the matched preset text features.

As another embodiment, the terminal device may perform character recognition on the target text, further obtain a service type based on a character recognition result, and finally determine a target fusion policy of the target text according to the service type.

In practice, it is found that for a data integrity sensitive TTS service (e.g., navigation), most of texts belong to data sensitive texts, that is, texts indicating a route, but there are also few delay sensitive texts, that is, texts indicating driving safety (e.g., a vehicle ahead is congested, please slow down and walk), and therefore, a target fusion policy of a target text is determined based on text features of the target text, so that analysis granularity for the fusion policy can be further refined, and synthesis of voice data further meets actual requirements.

303. And sending a synthesis request to the cloud according to the voice broadcast request, wherein the synthesis request is used for indicating the cloud to run a cloud voice synthesis process, and synthesizing voice data corresponding to the voice broadcast request through the cloud voice synthesis process.

For the description of step 301 and step 303, please refer to the description of step 201 and step 202 in the method shown in fig. 2, which is not limited herein.

304. And if the target voice data corresponding to the target fusion strategy returned by the cloud is not received in the first time period corresponding to the target fusion strategy, running a local voice synthesis process, and synthesizing the voice data corresponding to the voice broadcast request through the local voice synthesis process.

Different target voice data and first time periods can be set according to corresponding business requirements by different fusion strategies. In this embodiment of the application, if the target fusion policy is a delay-first fusion policy, the target voice data corresponding to the delay-first fusion policy may include the first frame voice data corresponding to the target text. If the target fusion policy is a data-first fusion policy, the target voice data corresponding to the data-first fusion policy may include all voice data corresponding to the target text.

In the embodiment of the application, because the delay-first fusion policy takes broadcasting delay as a priority policy, the efficiency of acquiring voice data is more emphasized, and therefore, the latency of the delay-first fusion policy on the voice data returned by the cloud end is generally short. The data priority fusion strategy takes the integrity of broadcast data as a priority strategy, and the integrity of voice data is more emphasized, so that the waiting time of the data priority fusion strategy for the voice data returned by the cloud end is generally longer. Thus, in some embodiments, the first time period corresponding to the delay-first fusion policy may be less than the first time period corresponding to the data-first fusion policy.

In some embodiments, if the target voice data corresponding to the target fusion policy returned by the cloud is not received in the first time period corresponding to the target fusion policy, the terminal device may send a termination request to the cloud, where the termination request is used to instruct the cloud to terminate the operation of the cloud voice synthesis process, that is, when the local voice synthesis process is operated, the cloud voice synthesis process operated in the cloud is cancelled, and then subsequently broadcasted voice data is obtained through the local voice synthesis process.

In some embodiments, if the target voice data corresponding to the target fusion policy returned by the cloud is not received in the first time period corresponding to the target fusion policy, the cloud voice synthesis process of the cloud may continuously run while the local voice synthesis process is running, and then the subsequently broadcasted voice data is obtained through the local voice synthesis process/the cloud voice synthesis process.

By implementing the method, the cloud end and the local voice data synthesis mode are combined, the quality and the processing efficiency of the voice data are considered, and the method is favorable for optimizing the processing effect of converting the characters into the voice. In addition, the terminal equipment can determine a target fusion strategy of the target text based on the voice broadcast request, and then synthesizes voice data of the target text according to the indication of the target fusion strategy, so that the flexibility of voice data synthesis is improved, and further the voice data synthesis is more suitable for business requirements.

Referring to fig. 4, fig. 4 is a schematic flowchart illustrating another speech data synthesis method according to an embodiment of the present application. The method can comprise the following steps:

401. and receiving a voice broadcast request.

402. And acquiring a target fusion strategy of a target text, wherein the target text is any text in text contents corresponding to the voice broadcast request.

403. And sending a synthesis request to the cloud according to the voice broadcast request, wherein the synthesis request is used for indicating the cloud to run a cloud voice synthesis process, and synthesizing voice data corresponding to the voice broadcast request through the cloud voice synthesis process.

404. And judging whether target voice data corresponding to the target fusion strategy returned by the cloud is received in a first time period corresponding to the target fusion strategy, if so, executing the step 405, and if not, executing the steps 406 to 407.

405. And broadcasting the voice data synthesized by the cloud voice synthesis process.

406. And running a local voice synthesis process, and synthesizing voice data corresponding to the voice broadcast request through the local voice synthesis process.

For the description of steps 401 to 406, please refer to steps 301 to 304 shown in fig. 3, which is not described herein again.

407. And judging whether target voice data corresponding to the target fusion strategy returned by the cloud is received in a second time period for running the local voice synthesis process, if so, executing step 408, and if not, executing step 409.

It should be noted that, if the determination result in step 407 is negative, the terminal device may send a termination request to the cloud. Optionally, the terminal device may directly send a termination request to the cloud; alternatively, step 409 is performed.

In this application embodiment, in order to improve voice data's tone quality as far as possible, can set up the grace period to high in the clouds speech synthesis process, the second time quantum promptly, the size of second time quantum has decided the speech data of broadcast and has derived from the proportion of high in the clouds speech synthesis process, and is specific, the second time quantum is bigger, and the speech data of broadcast is derived from the proportion of high in the clouds speech synthesis process and is bigger, otherwise, then is smaller.

Optionally, the determining manner of the second time period may include, but is not limited to, the following manners:

(1) the terminal equipment determines through a voice broadcast request module;

(2) and the terminal equipment determines according to the service type of the voice broadcast request module. Illustratively, if the service type of the voice broadcast request module is a delay sensitive TTS service, the second time period is short, and if the service type of the voice broadcast request module is a data integrity sensitive TTS service, the second time period is long.

(3) And the terminal equipment determines according to the target text. Optionally, the terminal device may determine according to the complexity and/or text length of the target text. The complexity of the target text may be related to the number of text types (chinese type or english type, etc.) included in the target text. Wherein, the larger the complexity of the target text is, the longer the text length is, and the longer the second time period is.

(4) And the terminal equipment determines according to the network signal strength. Optionally, the stronger the network signal is, the shorter the second time period is, and conversely, the longer the second time period is.

(5) And the terminal equipment determines according to the position information of the terminal equipment. Specifically, if the location information of the terminal device is a strong network area, the second time period is shorter, and conversely, the second time period is longer.

(6) If the terminal device is in a moving state (such as a vehicle-mounted computer), the terminal device can determine according to the moving speed. Specifically, the second time period is longer as the moving speed is higher, and conversely, the second time period is shorter. It can be understood that the faster the terminal device moves, the higher the probability of the network degradation of the terminal device, and conversely, the lower the probability of the network degradation of the terminal device.

408. And stopping running the local voice synthesis process, and broadcasting voice data synthesized by the cloud voice synthesis process.

In the embodiment of the application, when the target fusion strategy is a delay-first fusion strategy, when the first frame of voice data corresponding to the target text returned by the cloud is received in the second time period of running the local voice synthesis process, the first frame of voice data is broadcasted. It can be understood that if the voice data corresponding to the target text includes n frames, the terminal device receives one frame of voice data returned by the cloud, and then plays one frame of voice data.

In the embodiment of the application, when the target fusion strategy is a data-first fusion strategy, all voice data corresponding to the target text returned by the cloud is received in the second time period of running the local voice synthesis process, and then the all voice data is broadcasted. It can be understood that if the voice data corresponding to the target text includes n frames, the terminal device plays the n frames of voice data when receiving the n frames of voice data returned by the cloud.

409. It is determined whether the first frame of speech data synthesized by the local speech synthesis process is received within the second time period, if yes, step 410 is executed, and if not, step 411/step 412 is executed.

410. And sending a termination request to a cloud end, and broadcasting voice data synthesized by a local voice synthesis process, wherein the termination request is used for indicating the cloud end to terminate the operation of the cloud end voice synthesis process.

In the embodiment of the application, if the target fusion strategy is a delay-first fusion strategy, the terminal device obtains a frame of voice data corresponding to the target text through a local voice synthesis process, and broadcasts the frame of voice data. And if the target fusion strategy is a data-first fusion strategy, the terminal equipment obtains a frame of voice data corresponding to the target text through a local voice synthesis process, and broadcasts the frame of voice data, or the terminal equipment broadcasts all the voice data corresponding to the target text when obtaining all the voice data corresponding to the target text through the local voice synthesis process.

411. And if the target voice data returned by the cloud terminal is received at first, terminating the local voice synthesis process and broadcasting the voice data synthesized by the cloud terminal voice synthesis process.

412. And if the first frame of voice data synthesized by the local voice synthesis process is received, sending a termination request to the cloud end, and broadcasting the voice data synthesized by the local voice synthesis process.

Referring to fig. 5, fig. 5 is a flowchart illustrating a speech synthesis method corresponding to a delay-first fusion policy according to an embodiment of the present application.

Referring to fig. 6, fig. 6 is a schematic flowchart of a speech synthesis method corresponding to a data-first fusion policy disclosed in an embodiment of the present application. By implementing the method, the cloud end and the local voice data synthesis mode are combined, the quality and the processing efficiency of the voice data are considered, and the method is favorable for optimizing the processing effect of converting the characters into the voice. In addition, the terminal equipment can determine a target fusion strategy of the target text based on the voice broadcast request, and then synthesizes voice data of the target text according to the indication of the target fusion strategy, so that the flexibility of voice data synthesis is improved, and further the voice data synthesis is more suitable for business requirements. Furthermore, by setting a wide time period (second time period) for the cloud speech synthesis process, the proportion of the played speech data from the cloud speech synthesis process can be increased, and the tone quality of the speech data can be further improved.

Referring to fig. 7, fig. 7 is a block diagram of a terminal device according to an embodiment of the present disclosure, and may include an obtaining unit 701 and a processing unit 702. Wherein:

an obtaining unit 701, configured to receive a sent voice broadcast request;

the processing unit 702 is configured to send a synthesis request to the cloud according to the voice broadcast request, where the synthesis request is used to instruct the cloud to run a cloud voice synthesis process, and synthesize voice data corresponding to the voice broadcast request through the cloud voice synthesis process;

the processing unit 702 is further configured to run a local speech synthesis process if the target speech data returned by the cloud is not received in the first time period, and synthesize the speech data corresponding to the speech broadcast request through the local speech synthesis process.

Referring to fig. 8, fig. 8 is a block diagram of another structure of a terminal device disclosed in the embodiment of the present application, which may include an obtaining unit 701, a processing unit 702, and a playing unit 703. Wherein:

as for the obtaining unit 701 and the processing unit 702, please refer to the description of the terminal device shown in fig. 7, which is not described herein again. The playing unit 703 is configured to broadcast the voice data synthesized in the cloud voice synthesis process if the target voice data returned by the cloud is received within the first time period.

In some implementations, the playing unit 703 is further configured to terminate the running of the local speech synthesis process and broadcast the speech data synthesized by the cloud speech synthesis process if the target speech data returned by the cloud is received within the second time period in which the local speech synthesis process is running; and if the target voice data returned by the cloud is not received in the second time period of running the local voice synthesis process, sending a termination request to the cloud, and broadcasting the voice data synthesized by the local voice synthesis process, wherein the termination request is used for indicating the cloud to terminate running the cloud voice synthesis process.

In some embodiments, the playing unit 703 is specifically configured to send a termination request to the cloud if the target speech data returned by the cloud is not received in the second time period during which the local speech synthesis process is running, and the first frame of speech data synthesized by the local speech synthesis process is received in the second time period.

In some embodiments, the playing unit 703 is further configured to continue to wait if the target speech data returned by the cloud is still not received in the second time period in which the local speech synthesis process is running, and the first frame of speech data synthesized by the local speech synthesis process is not received in the second time period; if target voice data returned by the cloud end is received at first, terminating the local voice synthesis process, and broadcasting the voice data synthesized by the cloud end voice synthesis process; and if the first frame of voice data synthesized by the local voice synthesis process is received at first, sending a termination request to the cloud end, and broadcasting the voice data synthesized by the local voice synthesis process, wherein the termination request is used for indicating the cloud end to terminate the operation of the cloud end voice synthesis process.

In some embodiments, the obtaining unit 701 is further configured to obtain a target fusion policy of a target text after receiving the voice broadcast request, where the target text is any text in text content corresponding to the voice broadcast request.

Further, the processing unit 702 is specifically configured to run a local speech synthesis process if the target speech data corresponding to the target fusion policy and returned by the cloud is not received within the first time period corresponding to the target fusion policy.

In some embodiments, the manner for the obtaining unit 701 to obtain the target fusion policy of the target text may include, but is not limited to, the following manners:

(1) the obtaining unit 701 is specifically configured to obtain a policy identifier from the voice broadcast request, and determine a fusion policy indicated by the policy identifier as a target fusion policy of the target text.

(2) The obtaining unit 701 is specifically configured to obtain a service type corresponding to the voice broadcast request, and determine a fusion policy matched with the service type as a target fusion policy of the target text.

(3) The obtaining unit 701 is specifically configured to perform text feature matching on the target text by using a preset text feature matching library to obtain a text feature of the target text, and determine a fusion policy matched with the text feature as a target fusion policy of the target text.

In some embodiments, the target fusion policy may include a delay-first fusion policy, and the target speech data corresponding to the delay-first fusion policy includes first frame speech data corresponding to the target text.

In some embodiments, the target fusion policy may include a data-first fusion policy, and the target voice data corresponding to the data-first fusion policy includes all voice data corresponding to the target text.

Referring to fig. 9, fig. 9 is a block diagram of another structure of a terminal device disclosed in the embodiment of the present application. The method comprises the following steps: a processor 901 and a memory 902.

Among other things, the processor 901 has the following functions:

receiving a sent voice broadcast request;

sending a synthesis request to the cloud according to the voice broadcast request, wherein the synthesis request is used for indicating the cloud to operate a cloud voice synthesis process, and synthesizing voice data corresponding to the voice broadcast request through the cloud voice synthesis process;

In some embodiments, the processor 901 further has the following functions:

if target voice data returned by the cloud terminal are received in a second time period for running the local voice synthesis process, the running of the local voice synthesis process is stopped, and voice data synthesized by the cloud terminal voice synthesis process is broadcasted;

and if the target voice data returned by the cloud end is not received in the second time period of running the local voice synthesis process, sending a termination request to the cloud end, broadcasting the voice data synthesized by the local voice synthesis process, wherein the termination request is used for indicating the cloud end to terminate running the cloud end voice synthesis process.

In some embodiments, the processor 901 further has the following functions:

and if the target voice data returned by the cloud end is still not received in the second time period for running the local voice synthesis process, and the first frame voice data synthesized by the local voice synthesis process is received in the second time period, sending a termination request to the cloud end.

In some embodiments, the processor 901 further has the following functions:

after receiving the voice broadcast request, acquiring a target fusion strategy of a target text, wherein the target text is any text in text contents corresponding to the voice broadcast request.

And if the target voice data corresponding to the target fusion strategy returned by the cloud is not received in the first time period corresponding to the target fusion strategy, running a local voice synthesis process.

In some embodiments, the processor 901 further has the following functions:

and acquiring a strategy identifier from the voice broadcast request, and determining the fusion strategy indicated by the strategy identifier as a target fusion strategy of the target text.

In some embodiments, the processor 901 further has the following functions:

and acquiring a service type corresponding to the voice broadcast request, and determining a fusion strategy matched with the service type as a target fusion strategy of the target text.

In some embodiments, the processor 901 further has the following functions:

and performing text feature matching on the target text by using a preset text feature matching library to obtain the text feature of the target text, and determining a fusion strategy matched with the text feature as the target fusion strategy of the target text.

The memory 902 has the following functions:

the processing procedure and the processing result of the processor 901 are stored.

An embodiment of the application discloses a computer-readable storage medium storing a computer program, which when executed by a processor implements any one of the above-described method embodiments.

Embodiments of the present application disclose a computer program product comprising a non-transitory computer readable storage medium storing a computer program, and the computer program is operable to cause a computer to perform any of the above method embodiments.

The embodiment of the application discloses an application publishing platform, wherein the application publishing platform is used for publishing a computer program product, and when the computer program product runs on a computer, the computer is enabled to execute any method in the method embodiments.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that a computer can store or a data storage device, such as a server, a data center, etc., that is integrated with one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A speech data synthesis method, comprising:

receiving a voice broadcast request;

2. The method of claim 1, wherein after sending a synthesis request to a cloud according to the voice broadcast request, the method further comprises:

3. The method according to claim 1, wherein after the synthesizing of the voice data corresponding to the voice broadcast request by the local voice synthesis process, the method further comprises:

4. The method of claim 3, wherein sending a termination request to the cloud if the target speech data returned by the cloud is not received within a second time period in which the local speech synthesis process is running comprises:

5. The method of claim 3, further comprising:

6. The method according to any one of claims 1 to 5, wherein after the receiving of the voice announcement request, the method further comprises:

7. The method of claim 6, wherein obtaining the target fusion policy for the target text comprises:

acquiring a strategy identifier from the voice broadcast request;

8. The method of claim 6, wherein obtaining the target fusion policy for the target text comprises:

acquiring a service type corresponding to the voice broadcast request;

9. The method of claim 6, wherein obtaining the target fusion policy for the target text comprises:

10. The method of claim 6, wherein the target fusion policy comprises a delay-first fusion policy, and wherein the target speech data corresponding to the delay-first fusion policy comprises the first frame speech data corresponding to the target text.

11. The method of claim 6, wherein the target fusion policy comprises a data-first fusion policy, and wherein the target voice data corresponding to the data-first fusion policy comprises all voice data corresponding to the target text.

12. A terminal device, comprising:

the acquisition unit is used for receiving the sent voice broadcast request;

13. A terminal device, comprising:

a memory storing executable program code;

and a processor coupled to the memory;

the processor calls the executable program code stored in the memory, which when executed by the processor causes the processor to implement the method of any of claims 1-11.

14. A computer-readable storage medium having executable program code stored thereon, wherein the executable program code, when executed by a processor, implements the method of any of claims 1-11.