CN112527235A

CN112527235A - Voice playing method, device, equipment and storage medium

Info

Publication number: CN112527235A
Application number: CN202011511791.6A
Authority: CN
Inventors: 王坤; 葛永亮; 李坚涛
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Apollo Zhilian Beijing Technology Co Ltd
Priority date: 2020-12-18
Filing date: 2020-12-18
Publication date: 2021-03-19

Abstract

The application discloses a voice playing method, a voice playing device, voice playing equipment and a storage medium, and relates to the voice technology in the field of data processing. The specific implementation scheme is as follows: when a first text of a first application is acquired, the first text is added into a queue to be played, a TTS engine is called according to the text sequence in the queue to be played to convert each text in the queue to be played into voice and play the voice, wherein the first application is any one of a plurality of voice applications of a terminal device, and the queue to be played is used for storing the texts to be played of the plurality of voice applications. The process realizes that a plurality of voice applications share one TTS engine, and avoids the problem that the memory resource and the CPU resource are excessively occupied due to the simultaneous operation of a plurality of TTS engines.

Description

Voice playing method, device, equipment and storage medium

Technical Field

The present application relates to voice technologies in the field of data processing, and in particular, to a method, an apparatus, a device, and a storage medium for playing voice.

Background

Various voice applications can be installed in the vehicle-mounted terminal device, for example: a mapping application, a voice assistant application, a music application, a news application, etc.

These Speech applications are based on a Text-To-Speech (TTS) engine for Speech playback. The TTS engine is a software toolkit for converting text to speech. In the prior art, there is a need to integrate a TTS engine in every speech application. And each voice application calls a TTS engine integrated with the voice application to convert the text into voice and play the voice.

However, in the above method, when a plurality of voice applications run simultaneously, the memory resource and the Central Processing Unit (CPU) resource of the terminal device occupy too much.

Disclosure of Invention

The application provides a voice playing method, a voice playing device, voice playing equipment and a storage medium.

In a first aspect, the present application provides a voice playing method, which is applied to a terminal device, where a TTS engine is deployed in a system service of the terminal device, and the method includes:

when a first text of a first application is acquired, adding the first text into a queue to be played, wherein the first application is any one of a plurality of voice applications of the terminal equipment, and the queue to be played is used for storing the texts to be played of the plurality of voice applications;

and calling the TTS engine to convert each text in the queue to be played into voice according to the text sequence in the queue to be played, and playing the voice.

In a second aspect, the present application provides a speech playing apparatus, which is applied to a terminal device, where a TTS engine is deployed in a system service of the terminal device, and the apparatus includes:

the device comprises a queue maintenance unit, a queue management unit and a display unit, wherein the queue maintenance unit is used for adding a first text of a first application into a queue to be played when the first text of the first application is acquired, the first application is any one of a plurality of voice applications of the terminal equipment, and the queue to be played is used for storing the texts to be played of the plurality of voice applications;

and the playing processing unit is used for calling the TTS engine to convert each text in the queue to be played into voice according to the text sequence in the queue to be played and playing the voice.

In a third aspect, the present application provides an electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the first aspects.

In a fourth aspect, the present application provides a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any of the first aspects.

In a fifth aspect, the present application provides a computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of the first aspects.

The application provides a voice playing method, a voice playing device, equipment and a storage medium, wherein the method comprises the following steps: when a first text of a first application is acquired, adding the first text into a queue to be played, and calling a TTS engine to convert each text in the queue to be played into voice and play the voice according to the sequence of the texts in the queue to be played, wherein the first application is any one of a plurality of voice applications of a terminal device, and the queue to be played is used for storing the texts to be played of the plurality of voice applications. The above process enables multiple speech applications to share a TTS engine. Therefore, under the condition that a plurality of voice applications run simultaneously, only one TTS engine of the terminal equipment runs, and the problem that memory resources and CPU resources are excessively occupied due to the fact that a plurality of TTS engines run simultaneously is solved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present application, nor do they limit the scope of the present application. Other features of the present application will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

fig. 1 is a schematic diagram of a software architecture of a terminal device in the prior art;

fig. 2 is a schematic diagram of a software architecture of a terminal device provided in the present application;

fig. 3 is a schematic flowchart of a voice playing method provided in the present application;

fig. 4 is a schematic diagram of a queue to be played according to the present application;

FIG. 5 is a schematic diagram of another queue to be played provided herein;

fig. 6 is a schematic flowchart of another voice playing method provided in the present application;

FIG. 7 is a schematic diagram of a voice playback process provided herein;

fig. 8A is a schematic structural diagram of a voice playing apparatus provided in the present application;

fig. 8B is a schematic structural diagram of another voice playing apparatus provided in the present application;

fig. 9 is a schematic structural diagram of an electronic device provided in the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The application provides a voice playing method, a voice playing device, voice playing equipment and a storage medium, which are applied to a voice technology in the field of data processing so as to reduce the occupation of memory resources and CPU resources of terminal equipment.

Various voice applications can be installed in the vehicle-mounted terminal device, for example: a mapping application, a voice assistant application, a music application, a news application, etc. These Speech applications are based on a Text-To-Speech (TTS) engine for Speech playback. The TTS engine is a software toolkit for converting text to speech.

In the prior art, there is a need to integrate a TTS engine in every speech application. Fig. 1 is a schematic diagram of a software architecture of a terminal device in the prior art. As shown in fig. 1, the software architecture of the terminal device may include an operating system layer, a system service layer, and an application layer. The operating system layer is used for running an operating system. The system service layer is arranged between the operating system layer and the application layer and used for providing a series of system services for the application layer. The application layer is used to deploy multiple applications, such as voice applications.

Referring to fig. 1, each voice application of the terminal device integrates a TTS engine. When each voice application needs to play voice, a TTS engine integrated with the voice application is called to convert the text to be played into voice and play the voice.

However, in the above manner, when multiple speech applications run simultaneously, multiple TTS engines run simultaneously, so that on one hand, memory resources and Central Processing Unit (CPU) resources of the terminal device occupy too much, which easily causes the terminal device to run in a stuck state. On the other hand, the maintenance cost of the TTS engine is high due to the fact that the number of TTS engines in the terminal device is large.

In order to solve at least one of the above technical problems, the present application improves a software architecture of a terminal device. Fig. 2 is a schematic diagram of a software architecture of a terminal device provided in the present application. As shown in fig. 2, the TTS engine is integrated into the system service layer, and the respective speech applications in the application layer do not need to be integrated with the TTS engine. When the voice application needs to play voice, a TTS engine integrated in a system service layer may be called, that is, a plurality of voice applications may share the TTS engine. Therefore, even under the condition that a plurality of voice applications run simultaneously, only one TTS engine (namely, the TTS engine integrated by the system service layer) is operated by the terminal equipment, so that the problem that the memory resource and the CPU resource are excessively occupied due to the simultaneous operation of a plurality of TTS engines is avoided, and the maintenance cost of the TTS engine is also reduced.

Based on the software architecture shown in fig. 2, the application provides a voice playing method, when a first text of any one of a plurality of voice applications of a terminal device is obtained, the first text is added into a queue to be played, and a TTS engine integrated with a system service is called to convert each text in the queue to be played into voice and play the voice according to a text sequence in the queue to be played. Through the process, the multiple voice applications share one TTS engine without integrating TTS in each voice application, so that the occupation of memory resources and CPU resources of the terminal equipment is reduced, and the maintenance cost of the TTS engine is reduced.

The technical solution of the present application will be described in detail with reference to several specific examples. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments.

Fig. 3 is a schematic flow chart of a voice playing method provided in the present application. As shown in fig. 3, the method of the present embodiment includes:

s301: when a first text of a first application is acquired, adding the first text into a queue to be played, wherein the first application is any one of a plurality of voice applications of a terminal device, and the queue to be played is used for storing the texts to be played of the plurality of voice applications.

The method of the present embodiment may be performed by a terminal device. The terminal equipment plays voice based on a TTS engine. For example, the terminal device can be a vehicle-mounted terminal device or a common terminal device.

A plurality of voice applications are installed in the terminal device, including but not limited to: a mapping application, a voice assistant application, a music application, a news application, etc. The terminal device may adopt the software architecture shown in fig. 2, that is, a TTS engine is deployed in the system service of the terminal device, and a TTS engine does not need to be deployed in each voice application of the terminal device.

In this embodiment, a queue to be played is maintained in the terminal device, where the queue to be played is used to store texts to be played for a plurality of voice applications of the terminal device.

In step S301, the first application is any one of a plurality of voice applications of the terminal device. The first text is a text to be played of the first application. And when the terminal equipment acquires the first text of the first application, adding the first text into the queue to be played. In other words, when the terminal device acquires a text to be played of any one of the plurality of voice applications, the text to be played is added to the queue to be played.

It should be understood that different speech applications may correspond to different text in different forms or content. For example, the text corresponding to the news application may be a piece of news, the text corresponding to the map application may be a navigation sentence, and the text corresponding to the voice assistant application may be a sentence that interacts with the user.

When any voice application generates a text to be played, the terminal equipment acquires the text to be played from the voice application. The way of generating the text to be played by different speech applications may also be different, and this embodiment does not limit this.

For example, some speech applications may periodically or event-triggered generate text to be played after being started. For example, after the user starts the news application, the news application may continuously generate the news text to be played, that is, each news text may generate the next news text after playing is completed. For another example, in the process that a user starts a map application and uses the map application to perform route navigation, when a vehicle runs to a specified position in a map, the map application generates a text corresponding to a navigation sentence.

For example, there are also some speech applications that require a user to manually trigger the generation of text to be played. For example, the user may not generate text to be played until the user interacts with the voice assistant application after the user launches the voice assistant application.

S302: and calling a TTS engine to convert each text in the queue to be played into voice according to the text sequence in the queue to be played, and playing the voice.

It can be understood that the TTS engine in this embodiment is a TTS engine integrated in the system service of the terminal device. And according to the sequence of the texts in the queue to be played, sequentially taking out the texts, sending the texts to a TTS engine for processing, converting the texts into voice by the TTS engine, and playing the voice. It should be noted that, the technology for converting the text into the speech and playing the speech by the TTS engine may adopt the prior art, which is not described herein again.

The scheme of the present application is illustrated below with reference to fig. 4.

Fig. 4 is a schematic diagram of a queue to be played according to the present application. As shown in fig. 4, it is assumed that the queue to be played includes 3 texts, which are, in order from the head of the queue to the tail of the queue, a text 1, a text 2, and a text 3. These 3 texts may be texts from the same speech application or texts from different speech applications.

At this time, if the terminal device obtains the text x, adding the text x into the queue to be played. It should be noted that the terminal device may add the text x to any position in the queue to be played. As shown in fig. 4, the terminal device may add the text x to the queue to be played according to any one of scheme 1, scheme 2, scheme 3, and scheme 4. Then, the TTS engine processes each text in turn according to the updated text sequence in the queue to be played.

On the basis of the above embodiment, when the terminal device obtains the first text of the first application, if the queue to be played further includes at least one second text, the terminal device determines the playing sequence of the first text and the at least one second text, and adds the first text to the queue to be played according to the playing sequence. The second text refers to the text already existing in the queue to be played.

For example, referring to fig. 4, when the text x is obtained, the text 1, the text 2, and the text 3 already exist in the queue to be played. In this case, the playing order between the text x and the text 1, the text 2, and the text 3 may be determined first. And then, adding the text x into a queue to be played according to the determined playing sequence. For example, if it is determined that the text x is to be played after the text 1 and before the text 2, the text x is added to the queue to be played by using the scheme 2 in fig. 4.

The determining of the playing sequence between the first text and the at least one second text may be implemented in various ways, which are described below in combination with several possible ways.

Mode 1: determining that the playing order is such that the first text precedes the at least one second text. That is, the playing order is opposite to the text acquisition order. The newly acquired text will be played preferentially. This approach requires adding the first text to the head of line position of the queue to be played.

With reference to the example shown in fig. 4, each time the first text is obtained, the first text is added to the queue to be played according to scheme 1. Since the latest acquired text is usually time-efficient, the scheme 1 can ensure the timeliness of the voice playing.

Mode 2: and determining the playing sequence of the first text and the at least one second text according to the playing priority of the application corresponding to the first application and the at least one second application.

In this manner, the play priorities of the plurality of voice applications in the terminal device can be determined in advance. For example: the playing priority can be sequentially as follows from high to low: a mapping application, a voice assistant application, a news application, a music application, etc. And recording the application corresponding to each second text in the queue to be played.

With reference to the example shown in fig. 4, assuming that text 1 in the queue to be played corresponds to a map application, and text 2 and text 3 correspond to music applications, when text x of a news application is acquired, since the playing priority of the news application is higher than that of the music application and lower than that of the map application, the playing order is determined as follows: text 1, text x, text 2, text 3. Further, the text x may be added to the queue to be played according to scheme 2 in fig. 4.

In the method, the playing sequence of the texts is determined according to the playing priority of the application, so that the texts of important applications can be played preferentially.

Optionally, in the practical application, a setting entry may be provided for the user, and the user defines the playing priority of each voice application, so as to meet the personalized requirements of different users.

Mode 3: and determining the playing sequence of the first text and the at least one second text according to the types of the first text and the at least one second text.

The type of text refers to the type of information described by the text. There are many ways of classification. For example, text can be classified into real-time and non-real-time types based on the timeliness of the information. Real-time type text needs to be played before non-real-time type text. The text can also be divided into important types and common types according to the important programs of the information. Important types of text need to be played before ordinary types of text.

With reference to the example shown in fig. 4, assuming that the text 1 in the queue to be played is a real-time type, and the texts 2 and 3 are non-real-time types, when the text x of the real-time type is obtained, the playing sequence determined according to the type of each text is as follows: text 1, text x, text 2, text 3. Further, the text x may be added to the queue to be played according to scheme 2 in fig. 4.

In the method, the playing sequence of the texts is determined according to the types of the texts, so that the texts of certain specific types (such as high timeliness and high important programs) can be preferentially played.

The voice playing method provided by the embodiment comprises the following steps: when the text of any one of the voice applications of the terminal equipment is acquired, the text is added into the queue to be played, and a TTS engine is called to convert each text in the queue to be played into voice and play the voice according to the sequence of the texts in the queue to be played. The above process enables multiple speech applications to share a TTS engine. Therefore, under the condition that a plurality of voice applications run simultaneously, only one TTS engine runs on the terminal equipment, the problem that the memory resource and the CPU resource are excessively occupied due to the fact that a plurality of TTS engines run simultaneously is solved, and the maintenance cost of the TTS engines is reduced.

In the above embodiment, each text is sequentially sent to the TTS engine for processing according to the sequence of the texts in the queue to be played and by taking the text as a unit. Based on the manner, with reference to the example shown in fig. 4, in the case that none of the text 1, the text 2, and the text 3 in the queue to be played is processed by the TTS engine, the optional inter-cut of the text x can be realized. That is, the text x may be inserted at any position in the queue to be played.

However, if the text x is acquired when the TTS engine is processing the text 1, because the TTS engine processes the text in units of text, the TTS engine may process the text x only after the text 1 is processed, and therefore, based on the above embodiment, even if the determined playing sequence is that the text x precedes the text 1, the text x cannot be inserted before the text 1, that is, "the TTS engine pauses playing the text 1, starts playing the text x, and resumes playing the text 1 after the text x is played.

In order to solve the above problem, on the basis of the above embodiments, the present application further provides another voice playing method. This is described below in conjunction with fig. 5 to 7.

Fig. 5 is a schematic diagram of another queue to be played provided in the present application. As shown in fig. 5, in this embodiment, the queue to be played is used to store segments corresponding to texts to be played of a plurality of voice applications. The queue to be played in fig. 5 includes: segment 1 through segment 4 of text 1, segment 1 and segment 2 of text 2, and segment 1 and segment 2 of text 3. Because the fragments are stored in the queue to be played in units of the fragments, the TTS engine can process each fragment in sequence in units of the fragments. In this way, even if the TTS engine acquires the text x while processing the text 1, arbitrary insertion of the text x can be realized, and the method shown in fig. 6 can be adopted.

Fig. 6 is a schematic flowchart of another speech playing method provided in the present application. As shown in fig. 6, the method of the present embodiment includes:

s601: when the first text of the first application is acquired, the playing sequence of the first text and each text in the queue to be played is determined.

It should be understood that the manner of determining the playing sequence in S601 may refer to manner 1 to manner 3 in the above embodiments, which is not described herein again.

S602: if the playing sequence is that the first text precedes the third text, dividing the first text into a plurality of segments, and inserting the segments in front of the unplayed segment of the third text in the queue to be played, wherein the third text is a text currently processed by a TTS engine.

The first text may be divided into a plurality of segments in a plurality of dividing manners. Two possible examples are given below.

In one possible implementation, the first text is divided into a plurality of segments in units of a preset number of characters. For example, every 10 characters are divided into a segment.

In another possible implementation manner, a specific separator in the first text is detected, and the first text is divided into a plurality of segments according to the specific separator. Wherein, the specific separator includes but is not limited to: demarcating symbols, conjunctions, word strength words, etc.

S603: and calling a TTS engine to convert each segment in the queue to be played into voice according to the sequence of the segments in the queue to be played, and playing the voice.

In connection with the example shown in fig. 5, it is assumed that when the TTS engine is currently processing segment 1 of text 1 (i.e., the third text is text 1), the terminal device acquires text x (the first text). If the determined playing sequence is that the text x precedes the text 1, the text x may be divided into a plurality of segments, and the segment of the text x is inserted before the segment 2 of the text 1.

Thus, the TTS engine sequentially processes the segments according to the sequence of the segments in the queue to be played, starts to process the segments of the text x after processing the segment 1 of the text 1, and continues to process the segments 2, 3, and 4 of the text 1 after processing the segments of the text x. Therefore, the text x is inserted before the text 1, namely that the TTS engine pauses playing the text 1, starts playing the text x and resumes playing the text 1 after the text x is played.

The following is an example of a specific application scenario. It is assumed that the user starts a news application and a map application in the in-vehicle terminal device while driving the vehicle. The user listens to news through the news application while navigating the route through the map application. The following describes the voice playing process of the application scenario with reference to fig. 7.

Fig. 7 is a schematic diagram of a voice playing process provided in the present application. As shown in fig. 7, the terminal device acquires a news text to be played from a news application, divides the news text to be played into a plurality of segments, and adds the segments into a queue to be played. And then, calling a TTS engine to sequentially process each segment in the queue to be played according to the sequence of the segments in the queue to be played. Referring to fig. 7, assume that the TTS engine has completed processing segment 1 of the news text, currently processing segment 2 of the news text.

At this time, the terminal device acquires a map text (e.g., a text corresponding to a navigation sentence) to be played from the map application. The terminal device determines that the map text needs to be played preferentially according to the fact that the playing priority of the map application is higher than the playing priority of the news application, divides the map text into a plurality of segments (two segments are illustrated in fig. 7), and inserts the segment corresponding to the map text before the unprocessed segment of the news text (i.e., before the segment 3 of the news text).

Thus, the TTS engine processes the segment 1 and segment 2 of the map text after processing the segment 2 of the news text. After processing of the two segments of map text is completed, segment 3 of news text and subsequent segments will continue to be processed. That is, the terminal device may temporarily interrupt the playing of the news and start playing the map navigation sentence, and after the playing of the map navigation sentence is completed, continue playing the news from the news interrupt position.

In this embodiment, by storing the segments corresponding to each text in the queue to be played, when the terminal device acquires a new text, the segment of the new text may be inserted before the unvarnished segment of the text currently being processed in the queue to be played, so that the TTS engine may interrupt the current text, preferentially process the new text, and resume playing the previous text after the new text is processed.

Fig. 8A is a schematic structural diagram of a voice playing apparatus provided in the present application. The apparatus of the present embodiment may be in the form of software and/or hardware. The device can be arranged in the terminal equipment, and a TTS engine is deployed in the system service of the terminal equipment. As shown in fig. 8A, the voice playing apparatus 800 provided in this embodiment includes: a queue maintenance unit 801 and a play processing unit 802.

The queue maintenance unit 801 is configured to, when a first text of a first application is acquired, add the first text into a queue to be played, where the first application is any one of a plurality of voice applications of the terminal device, and the queue to be played is used to store the texts to be played of the plurality of voice applications;

the playing processing unit 802 is configured to invoke the TTS engine to convert each text in the queue to be played into a voice according to the text sequence in the queue to be played, and play the voice.

Fig. 8B is a schematic structural diagram of another audio playing apparatus provided in the present application. In the voice playing apparatus 800 of this embodiment, on the basis of fig. 8A, the queue maintenance unit 801 includes: a determination module 8011 and an insertion module 8012.

The determining module 8011 is configured to, when a first text of a first application is acquired, determine a playing sequence of the first text and at least one second text if the queue to be played further includes the at least one second text;

the inserting module 8012 is configured to add the first text to the queue to be played according to the playing order.

In a possible implementation manner, the determining module 8011 is specifically configured to:

determining that the playback order is that the first text precedes the at least one second text.

and determining the playing sequence of the first text and the at least one second text according to the playing priority of the application corresponding to the first application and the at least one second text respectively.

and determining the playing sequence of the first text and the at least one second text according to the types of the first text and the at least one second text.

In a possible implementation manner, the queue to be played is used to store segments corresponding to texts to be played of the plurality of voice applications; the at least one second text comprises a third text, and the third text is a text currently processed by the TTS engine;

if the playing sequence is that the first text precedes the third text, the inserting module 8012 is specifically configured to: dividing the first text into a plurality of segments, and inserting the plurality of segments in front of the unplayed segment of the third text in the queue to be played;

the play processing unit is specifically configured to: and calling the TTS engine to convert each segment in the queue to be played into voice according to the sequence of the segments in the queue to be played, and playing the voice.

In a possible implementation manner, the plug-in module 8012 is specifically configured to:

dividing the first text into a plurality of segments by taking a preset number of characters as a unit;

alternatively, the first and second electrodes may be,

detecting a specific separator in the first text, and dividing the first text into a plurality of segments according to the specific separator.

The voice playing apparatus provided in this embodiment may be configured to execute the voice playing method in any of the above method embodiments, and the implementation principle and the technical effect are similar, which are not described herein again.

According to an embodiment of the present application, the present application also provides an electronic device. The electronic equipment can be vehicle-mounted terminal equipment and also can be common terminal equipment of a user. The electronic device includes: at least one processor; and a memory communicatively coupled to the at least one processor.

The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to execute the voice playing method executed by the terminal device in any of the above embodiments. The implementation principle and the calculation effect are similar, and the detailed description is omitted here.

According to an embodiment of the present application, there is also provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the voice playing method in any one of the above embodiments. The implementation principle and the technical effect are similar, and the detailed description is omitted here.

According to an embodiment of the present application, there is also provided a computer program product, including a computer program, which when executed by a processor implements the voice playing method in any of the above embodiments. The implementation principle and the technical effect are similar, and the detailed description is omitted here.

FIG. 9 illustrates a schematic block diagram of an example electronic device 900 that can be used to implement embodiments of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 9, the electronic apparatus 900 includes a computing unit 901, which can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the device can also be stored. The calculation unit 901, ROM 902, and RAM 903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.

A number of components in the device 900 are connected to the I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, and the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, optical disk, or the like; and a communication unit 909 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 909 allows the device to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 901 performs the respective methods and processes described above, such as a voice playback method. For example, in some embodiments, the voice playback method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device via the ROM 902 and/or the communication unit 909. When the computer program is loaded into the RAM 903 and executed by the computing unit 901, one or more steps of the voice playback method described above may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured to perform the voice playback method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present application may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this application, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A voice playing method is applied to terminal equipment, a TTS engine is deployed in system service of the terminal equipment, and the method comprises the following steps:

2. The method according to claim 1, wherein adding the first text to a queue to be played when the first text of the first application is acquired comprises:

when a first text of a first application is acquired, if the queue to be played further comprises at least one second text, determining a playing sequence of the first text and the at least one second text, and adding the first text to the queue to be played according to the playing sequence.

3. The method of claim 2, wherein determining the order of play of the first text and the at least one second text comprises:

4. The method of claim 2, wherein determining the order of play of the first text and the second text comprises:

5. The method of claim 2, wherein determining the order of play of the first text and the at least one second text comprises:

6. The method according to any one of claims 3 to 5, wherein the queue to be played is used for storing segments corresponding to texts to be played of the plurality of voice applications; the at least one second text comprises a third text, and the third text is a text currently processed by the TTS engine;

if the playing sequence is that the first text is prior to the third text, adding the first text to the queue to be played according to the playing sequence, including:

dividing the first text into a plurality of segments, and inserting the plurality of segments in front of the unplayed segment of the third text in the queue to be played;

according to the text sequence in the queue to be played, calling the TTS engine to convert each text in the queue to be played into voice, and playing the voice, wherein the steps of:

and calling the TTS engine to convert each segment in the queue to be played into voice according to the sequence of the segments in the queue to be played, and playing the voice.

7. The method of claim 6, dividing the first text into a plurality of segments, comprising:

alternatively, the first and second electrodes may be,

8. A speech playing device is applied to a terminal device, a TTS engine is deployed in a system service of the terminal device, and the device comprises:

9. The apparatus of claim 8, wherein the queue maintenance unit comprises: a determining module and an inserting module;

the determining module is configured to determine, when a first text of a first application is acquired, a playing sequence of the first text and at least one second text if the queue to be played further includes the at least one second text;

and the inserting module is used for adding the first text into the queue to be played according to the playing sequence.

10. The apparatus of claim 9, the determination module being specifically configured to:

11. The apparatus of claim 9, the determination module being specifically configured to:

12. The apparatus of claim 9, the determination module being specifically configured to:

13. The apparatus according to any one of claims 10 to 12, wherein the queue to be played is configured to store segments corresponding to texts to be played for the plurality of voice applications; the at least one second text comprises a third text, and the third text is a text currently processed by the TTS engine;

if the playing sequence is that the first text precedes the third text, the inserting module is specifically configured to: dividing the first text into a plurality of segments, and inserting the plurality of segments in front of the unplayed segment of the third text in the queue to be played;

14. The apparatus of claim 13, the insertion module to be specifically configured to:

alternatively, the first and second electrodes may be,

15. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 7.

16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1 to 7.

17. A computer program product comprising a computer program which, when executed by a processor, implements the method of any one of claims 1 to 7.