CN111667815B

CN111667815B - Method, apparatus, chip circuit and medium for text-to-speech conversion

Info

Publication number: CN111667815B
Application number: CN202010498289.XA
Authority: CN
Inventors: 封宣阳; 蔡海蛟; 冯歆鹏; 周骥
Original assignee: NextVPU Shanghai Co Ltd
Current assignee: NextVPU Shanghai Co Ltd
Priority date: 2020-06-04
Filing date: 2020-06-04
Publication date: 2023-09-01
Anticipated expiration: 2040-06-04
Also published as: CN111667815A

Abstract

The application provides a method, device, computer readable storage medium and chip circuit for text-to-speech conversion. The method comprises the following steps: receiving a text to be converted; generating a unique identifier for the text to be converted; determining whether the unique identifier is stored; and if the unique identifier is determined to be stored, acquiring audio data corresponding to the unique identifier as output audio of the text to be converted.

Description

Method, apparatus, chip circuit and medium for text-to-speech conversion

Technical Field

The present application relates to the field of speech synthesis, and more particularly to a method for text-to-speech conversion, a device implementing such a method, a chip circuit and a computer readable storage medium.

Background

Speech recognition and speech synthesis techniques are two key techniques required to implement human-machine speech communication. Among them, speech synthesis is a technique of generating artificial speech by a mechanical, electronic method. TTS (Text To Speech) is an important Speech synthesis technique that can convert an input Text file into a Speech output in natural language. Currently, TTS technology is widely applied to various fields such as voice navigation, audio books, online translation, online education, etc., and helps to convert input or built-in text content into audio data and play the audio data. A typical TTS process includes: text content is input to the TTS engine, which converts the text content into audio data, which is then played through a speaker. The TTS engine repeatedly performs the above-described process without distinction even if the same text content is encountered later. Since audio output is only an auxiliary output mode in many current application scenarios, the audio output is not frequently used, and the timeliness requirement is not high, so that the processing burden caused by repetition is acceptable.

However, for some applications, such as applications dedicated to help vision-impaired people to read, frequent text-to-speech conversion is required, and if the above processing manner is still used, the processing amount and power consumption of the device will be greatly increased, which causes waste of resources, and it is difficult to ensure real-time performance.

Disclosure of Invention

Aiming at the problems, the application provides a scheme for converting text into voice, which accelerates the speed of traditional text-to-voice conversion, lightens the processing load of a voice synthesis module and is very suitable for application scenes with large text-to-voice processing requirement and high timeliness requirement.

According to one aspect of the present application, a method for text-to-speech conversion is provided. The method comprises the following steps: receiving a text to be converted; generating a unique identifier for the text to be converted; determining whether the unique identifier is stored; and if the unique identifier is determined to be stored, acquiring audio data corresponding to the unique identifier as output audio of the text to be converted.

According to another aspect of the present application, there is provided an apparatus for text-to-speech conversion. The apparatus includes: a memory having computer program code stored thereon; and a processor configured to run the computer program code to perform the method as described above.

According to yet another aspect of the present application, a computer-readable storage medium is provided. The computer readable storage medium has stored thereon computer program code which, when executed, performs the method as described above.

According to yet another aspect of the present application, there is provided a chip circuit comprising a circuit unit configured to perform the method as described above upon power-up.

Drawings

FIG. 1 illustrates a flow chart of a method for text-to-speech conversion according to some embodiments of the application;

FIG. 2 illustrates a flow chart of a method for text-to-speech conversion according to further embodiments of the application;

FIG. 3 illustrates a flow chart of a method for text-to-speech conversion according to still other embodiments of the present application; and

FIG. 4 shows a schematic block diagram of an example device that may be used to implement an embodiment of the application.

Detailed Description

The following detailed description of various embodiments of the present application will be provided in connection with the accompanying drawings to provide a clearer understanding of the objects, features and advantages of the present application. It should be understood that the embodiments shown in the drawings are not intended to limit the scope of the application, but rather are merely illustrative of the true spirit of the application.

In the following description, for the purposes of explanation of various inventive embodiments, certain specific details are set forth in order to provide a thorough understanding of the various inventive embodiments. One skilled in the relevant art will recognize, however, that an embodiment may be practiced without one or more of the specific details. In other instances, well-known devices, structures, and techniques associated with the present application may not be shown or described in detail to avoid unnecessarily obscuring the description of the embodiments.

Throughout the specification and claims, unless the context requires otherwise, the word "comprise" and variations such as "comprises" and "comprising" will be understood to be open-ended, meaning of inclusion, i.e. to be interpreted to mean "including, but not limited to.

Reference throughout this specification to "one embodiment" or "some embodiments" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment. Thus, appearances of the phrases "in one embodiment" or "in some embodiments" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Furthermore, the terms first, second and the like in the description and in the claims, are used for descriptive purposes only and not for limiting the size or other order of the objects described.

Fig. 1 illustrates a flow chart of a method 100 for text-to-speech conversion according to some embodiments of the application. The method 100 may be performed, for example, by the apparatus 400 as described below. As in fig. 4, fig. 4 shows a schematic block diagram of an example device 400 that may be used to implement an embodiment of the application.

As shown in fig. 1, the method 100 includes a step 110 of receiving text to be converted. Here, the text to be converted may be received from the outside through various I/O interfaces (I/O interface 450 described below). For example, the text to be converted may be input by a user through an input device such as a keyboard or a mouse. Alternatively, the text to be converted may be text generated after a user captures an image through an input device such as a camera and performs image recognition (e.g., optical Character Recognition (OCR)). In some other embodiments, the text to be converted may also be predetermined text preset in the device 400, such as text for voice prompts or navigation, etc.

Next, at step 120, a unique Identifier (ID) is generated for the text to be converted. The unique identifier is used to uniquely identify the text to be converted.

In some embodiments, the unique identifier may include a field for indicating the content of the text to be converted. For example, the field may contain some coding of the text to be converted itself or some variation of the coding (e.g., hash value).

In other embodiments, the unique identifier may also include other information about the text to be converted. For example, the unique identifier may include a second field for indicating at least one of a type of a speech synthesis module (e.g., TTS engine) used by the device 400, a user preferred speech rate, and a tone color, in addition to the first field for indicating the content of the text to be converted, as described above. Here, the voice synthesis module may be any type of voice synthesis module currently existing or developed in the future, and the type thereof may include a model (manufacturer), version, and the like of the voice synthesis module. The speech synthesis module may be implemented, for example, by CPU 410 in device 400 running a computer program for implementing speech synthesis. Alternatively, the speech synthesis module may be a separate dedicated chip or the like. As an example of a speech synthesis module, various types of TTS engines have been developed by a number of vendors (e.g., google, mass, hundred degrees, etc.), and all of these TTS engines may be used as the speech synthesis module described herein. The speech rate of the user's preference may be set by the user at initialization or automatically by the device 400 based on the user's previous behavioral habits, such as the speech rate of the user when making voice input. For example, the speech rate may be set to several gears, such as fast, medium, slow, etc., or the speech rate may be set to a specific rate, such as 60-80 words/min. The user preferred timbre may also be set by the user at initialization or automatically by the device 400 according to the user's previous behavioral habits. For example, the tone may include male, female, child, etc. Alternatively, the tone color may be a tone color customized by a user, such as the voice of the user himself or a specific other person, or the like.

Next, at step 130, a determination is made as to whether the unique identifier is stored. In one embodiment, a database is utilized in device 400 to store unique identifiers corresponding to different text. Here, the database may be a database integrated in the device 400 as described below (e.g., in ROM 420, RAM 430, storage unit 480, or other flash memory) or in a separate storage device independent of the device 400. In one embodiment, to ensure real-time output audio, the database is provided in a form that facilitates quick access by the processor of the device 400, such as in a cache.

If it is determined that the unique identifier is stored (yes in step 130), in step 140, audio data corresponding to the unique identifier is acquired as output audio corresponding to the text to be converted. The output audio may be played to the user at step 170 as output audio corresponding to the text to be converted. In one embodiment, audio data corresponding to unique identifiers of different text and related information for the audio data may be stored in a cache of device 400. Here, the cache may be located in RAM 430 (e.g., NVRAM), storage unit 480 of device 400, or in a flash memory, disk, or hard disk that is separate from device 400, as described below.

In some embodiments, the information associated with the unique identifier may include a starting location of the audio data in the buffer corresponding to the unique identifier and a length of the audio data.

On the other hand, if it is determined that the unique identifier is not saved (no in step 130), then in step 150, the text to be converted may be converted into audio data (e.g., by inputting the text to be converted into a speech synthesis module). The audio data may be played at step 170 as output audio corresponding to the text to be converted.

In addition, between steps 150 and 170, the method 100 may further include step 160, in which the audio data obtained in step 150 is stored and related information about the audio data (such as a starting position of the audio data in the buffer and a length of the audio data) is stored. In one embodiment, the audio data resulting from step 150 may be stored in a buffer and related information for the audio data stored in a database, such that the audio data and related information facilitate the execution of the next text-to-speech conversion process.

Note that although step 160 is shown in fig. 1 as being located between steps 150 and 170, those skilled in the art will appreciate that step 160 may also be performed after step 170 or in parallel with step 170 without departing from the scope of the present disclosure.

With the scheme shown in fig. 1, for the text to be converted in which the corresponding audio data has been stored in the device 400, the method 100 may directly acquire the audio data without performing the process of speech synthesis, thereby greatly improving the real-time performance of audio data acquisition. The text to be converted, which does not store the corresponding audio data in the device 400, can still be converted by the speech synthesis module, so that the conversion effect is ensured.

Fig. 2 illustrates a flow chart of a method 200 for text-to-speech conversion in accordance with further embodiments of the present application. The method 200 may be performed, for example, by the apparatus 400 as described below.

In the embodiment shown in fig. 2, steps 210, 220, 230, 240, 250, 260, and 270 are similar to steps 110, 120, 130, 140, 150, 160, and 170 in the embodiment shown in fig. 1. In the embodiment shown in fig. 2, the unique identifier of the text to be converted may include a second field for indicating at least one of a type of a voice synthesis module used by the apparatus 400, a user-preferred speech rate, and a tone color, in addition to the first field for indicating the content of the text to be converted.

Unlike the embodiment shown in fig. 1, in the method 200, after step 230, if it is determined that the unique identifier of the text to be converted is not saved (no in step 230), then in step 235, it is determined whether a second identifier matching the first field of the unique identifier is saved. In one embodiment, the first field of the matched second identifier is identical to the first field of the unique identifier of the text to be converted. In another embodiment, the first field of the matched second identifier is semantically identical or similar to the first field of the unique identifier of the text to be converted. In this case, in step 235, the text to be converted may be semantically analyzed and a second identifier having the same or similar semantics may be found from the database. Alternatively, the database may store a plurality of corresponding relation tables of texts with the same or similar semantics in advance. In this case, in step 235, a second identifier that is the same as or similar to the semantics of the text to be converted may be looked up from the correspondence table.

That is, a unique identifier (i.e., a second identifier) of a text that is identical or similar to the content of the text to be converted but different in other information (e.g., type of speech synthesis module used, speed of speech, and color of voice preferred by the user) may be found. If such a second identifier is present, it means that audio data of text which is identical or similar to the content of the text to be converted but different from other information has been stored in the buffer.

In this case, if it is determined that the second identifier is stored (yes in step 235), then in step 245, audio data corresponding to the second identifier may be found as output audio (e.g., based on information about the second identifier stored in the database). The output audio may be played as audio data corresponding to the text to be converted at step 270. That is, the output audio is audio data capable of embodying the content of the text to be converted but not fully conforming to the user's needs (type of speech synthesis module, speed of speech and tone of user preference, etc.).

On the other hand, if it is determined that the second identifier is not stored (no in step 235), the method 200 may proceed to step 250 to convert the text to be converted into audio data as output audio, similar to that in step 150.

With the scheme shown in fig. 2, in the case where audio data corresponding to the unique identifier of the text to be converted is not stored but audio data identical or similar to the content itself of the text to be converted is stored, the efficiency of the entire text to speech conversion can be improved at the cost of limited user demands, thereby improving user experience.

Fig. 3 illustrates a flow chart of a method 300 for text-to-speech conversion according to further embodiments of the application. The method 300 may be performed, for example, by the apparatus 400 as described below.

In the embodiment shown in fig. 3, steps 310, 320, 330, 340, 350, 360, and 370 are similar to steps 110, 120, 130, 140, 150, 160, and 170 in the embodiment shown in fig. 1 and steps 210, 220, 230, 240, 250, 260, and 270 in the embodiment shown in fig. 2.

Unlike the embodiment shown in fig. 1 and 2, in the method 300, after converting the text to be converted into audio data in step 350, step 355 may also be included, wherein it is further determined whether the conversion of step 350 is successful. And when it is determined that the conversion of step 350 is successful (yes in step 350), step 370 is executed to play the converted audio data as output audio. Conversely, if it is determined that the conversion of step 350 is unsuccessful, step 380 may be performed to obtain preset specific audio data as output audio (e.g., from a buffer) and play the output audio in step 370. Here, the preset specific audio data may indicate to the user that the conversion is unsuccessful and/or guide the user to perform a specific operation. For example, the specific audio data may be a malfunction processing guidance voice or a customer service guidance voice preset in the apparatus 400 for guiding the user to troubleshoot himself or for guiding the user to seek customer service assistance.

The methods 100 through 300 illustrated in fig. 1 through 3 are exemplary to illustrate aspects according to embodiments of the present application, and those skilled in the art will appreciate that these illustrations are not limiting of the scope of the present application, but may be combined or modified in various ways to extend the scope of the present application. For example, in one combination of the embodiments of fig. 2 and 3, steps 235 and 245 as shown in fig. 2 may be performed between steps 330 and 350. In another combination of the embodiments of fig. 2 and 3, steps 235 and 245 as shown in fig. 2 may be performed in place of step 380 when no is determined in step 355.

Fig. 4 shows a schematic block diagram of an example device 400 that may be used to implement an embodiment of the application. The device 400 may be, for example, a desktop or laptop computer or other electronic device for text-to-speech conversion. As shown, the device 400 may include one or more Central Processing Units (CPUs) 410 (only one schematically shown) that may perform various suitable actions and processes, such as a TTS engine for TTS conversion, according to computer program instructions stored in a Read Only Memory (ROM) 420 or loaded from a storage unit 480 into a Random Access Memory (RAM) 430. In RAM 430, various programs and data required for the operation of device 400 may also be stored. CPU 410, ROM 420, and RAM 430 are connected to each other by bus 440. An input/output (I/O) interface 450 is also connected to bus 440.

Various components in device 400 are connected to I/O interface 450, including: an input unit 460 such as a keyboard, a mouse, etc.; an output unit 470 such as various types of displays, speakers, and the like; a storage unit 480 such as a magnetic disk, an optical disk, or the like; and a communication unit 490, such as a network card, modem, wireless communication transceiver, etc. The communication unit 490 allows the device 400 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks.

The methods 100 to 300 described above may be performed, for example, by the processing unit 410 of the device 400. For example, in some embodiments, methods 100-300 may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 480. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 400 via ROM 420 and/or communication unit 490. When the computer program is loaded into RAM 430 and executed by CPU 410, one or more operations of methods 100 through 300 described above may be performed. In addition, the communication unit 490 may support wired or wireless communication functions.

Methods 100 through 300 for text-to-speech conversion and apparatus 400 according to the present application are described above with reference to the accompanying drawings. Those skilled in the art will appreciate, however, that device 400 need not include all of the components shown in fig. 4, but may include only some of the components necessary to perform the functions described herein, and that the manner in which these components are connected is not limited to the form shown in the figures. For example, where the device 400 is a portable device such as a cell phone, the device 400 may have a different structure than in fig. 4.

The present application may be embodied as methods, apparatus, chip circuits and/or computer program products. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for performing various aspects of the present application. The chip circuitry may include circuit elements for performing various aspects of the application.

The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.

The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.

Computer program instructions for carrying out operations of the present application may be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, c++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of remote computers, the remote computer may be connected to the user computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (e.g., connected through the internet using an internet service provider). In some embodiments, aspects of the present application are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information for computer readable program instructions, which can execute the computer readable program instructions.

Various aspects of the present application are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

According to some embodiments of the present application, a method for text-to-speech conversion is provided. The method comprises the following steps: receiving a text to be converted; generating a unique identifier for the text to be converted; determining whether the unique identifier is stored; and if the unique identifier is determined to be stored, acquiring the audio data corresponding to the unique identifier as the output audio of the text to be converted.

According to some embodiments of the application, wherein determining whether the unique identifier is stored comprises: determining whether the unique identifier is stored in a database, and wherein obtaining audio data corresponding to the unique identifier as output audio of the text to be converted comprises: and acquiring audio data corresponding to the unique identifier from a cache based on the related information of the unique identifier stored in the database as output audio of the text to be converted.

According to some embodiments of the application, the method further comprises: if the unique identifier is not stored, converting the text to be converted into audio data as the output audio; and storing the audio data and related information of the audio data.

According to some embodiments of the application, converting the text to be converted into audio data as the output audio comprises: the text to be converted is input into a speech synthesis module to be converted into audio data as the output audio.

According to some embodiments of the application, the speech synthesis module comprises a TTS engine.

According to some embodiments of the application, storing the audio data and related information of the audio data comprises: the audio data is stored in a buffer and relevant information of the audio data is stored in a database.

According to some embodiments of the application, wherein the unique identifier comprises: a first field indicating the content of the text to be converted, and a second field indicating at least one of a type of a speech synthesis module, a user preferred speech rate, and a tone color.

According to some embodiments of the application, the method further comprises: if it is determined that the unique identifier is not stored, determining whether a second identifier matching the first field of the unique identifier is stored; and if it is determined that the second identifier is stored, searching audio data corresponding to the second identifier as the output audio.

According to some embodiments of the application, the method further comprises: and if the second identifier is not stored, converting the text to be converted into audio data as the output audio.

According to some embodiments of the application, the method further comprises: determining whether the text to be converted is successfully converted into audio data or not; playing the output audio if the text to be converted is determined to be successfully converted into the audio data; and if the text to be converted is determined to be unsuccessful in converting the text to be converted into the audio data, acquiring preset specific audio data as the output audio, wherein the specific audio data indicates that the conversion is unsuccessful and/or guides a user to execute specific operations.

According to some embodiments of the application, the related information comprises: a starting position of the audio data in the buffer memory and a length of the audio data.

There is also provided, in accordance with some embodiments of the present application, an apparatus for text-to-speech conversion. The apparatus includes: a memory having computer program code stored thereon; and a processor configured to run the computer program code to perform the method as described above.

According to some embodiments of the present application, there is also provided a computer readable storage medium having stored thereon computer program code which, when executed, performs a method as described above.

According to some embodiments of the application there is also provided a chip circuit comprising a circuit unit configured to perform the method as described above upon power up.

The foregoing description of embodiments of the application has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvement of the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method for text-to-speech conversion, comprising:

receiving a text to be converted;

generating a unique identifier for the text to be converted, wherein the unique identifier comprises:

a first field indicating the content of the text to be converted, and

a second field indicating at least one of a type of a speech synthesis module, a user preferred speech rate, and a tone color;

determining whether the unique identifier is stored;

if the unique identifier is determined to be stored, acquiring audio data corresponding to the unique identifier as output audio of the text to be converted;

if it is determined that the unique identifier is not stored, determining whether a second identifier matching the first field of the unique identifier is stored;

if the second identifier is determined to be stored, searching the audio data corresponding to the second identifier as the output audio;

if the second identifier is not stored, converting the text to be converted into audio data serving as the output audio; and

storing the audio data and related information of the audio data.

2. The method of claim 1, wherein determining whether the unique identifier is stored comprises:

determining whether the unique identifier is maintained in the database, and wherein

The obtaining of the audio data corresponding to the unique identifier as the output audio of the text to be converted comprises:

and acquiring audio data corresponding to the unique identifier from a cache based on the related information of the unique identifier stored in the database as output audio of the text to be converted.

3. The method of claim 1, wherein converting the text to be converted into audio data as the output audio comprises:

the text to be converted is input into a speech synthesis module to be converted into audio data as the output audio.

4. A method as claimed in claim 3, wherein the speech synthesis module comprises a TTS engine.

5. The method of claim 1, wherein storing the audio data and related information for the audio data comprises:

the audio data is stored in a buffer and relevant information of the audio data is stored in a database.

6. The method of claim 1, further comprising:

determining whether the text to be converted is successfully converted into audio data or not;

playing the output audio if the text to be converted is determined to be successfully converted into the audio data; and

and if the text to be converted is not successfully converted into the audio data, acquiring preset specific audio data as the output audio, wherein the specific audio data indicates that the conversion is unsuccessful and/or guides a user to execute specific operations.

7. The method of claim 2 or 5, wherein the related information comprises: a starting position of the audio data in the buffer memory and a length of the audio data.

8. An apparatus for text-to-speech conversion, comprising:

a memory having computer program code stored thereon; and

a processor configured to run the computer program code to perform the method of any of claims 1 to 7.

9. A computer readable storage medium having stored thereon computer program code which, when executed, performs the method of any of claims 1 to 7.

10. A chip circuit comprising a circuit unit configured to perform the method of any of claims 1 to 7 upon power up.