CN111667815A

CN111667815A - Method, apparatus, chip circuit and medium for text-to-speech conversion

Info

Publication number: CN111667815A
Application number: CN202010498289.XA
Authority: CN
Inventors: 封宣阳; 蔡海蛟; 冯歆鹏; 周骥
Original assignee: NextVPU Shanghai Co Ltd
Current assignee: NextVPU Shanghai Co Ltd
Priority date: 2020-06-04
Filing date: 2020-06-04
Publication date: 2020-09-15
Anticipated expiration: 2040-06-04
Also published as: CN111667815B

Abstract

The invention provides a method, a device, a computer-readable storage medium and a chip circuit for text-to-speech conversion. The method comprises the following steps: receiving a text to be converted; generating a unique identifier for the text to be converted; determining whether the unique identifier is stored; and if the unique identifier is determined to be saved, acquiring audio data corresponding to the unique identifier as output audio of the text to be converted.

Description

Method, apparatus, chip circuit and medium for text-to-speech conversion

Technical Field

The present invention relates to the field of speech synthesis, and more particularly, to a method for text-to-speech conversion, an apparatus implementing such method, a chip circuit and a computer-readable storage medium.

Background

Speech recognition and speech synthesis techniques are two key techniques required to implement human-computer speech communication. Among them, speech synthesis is a technique of generating artificial speech by a mechanical, electronic method. TTS (Text To Speech) is an important Speech synthesis technology that can convert an input Text file into a Speech output in natural language. Currently, TTS technology is widely used in many fields such as voice navigation, audio books, online translation, online education, etc. to help convert input or built-in text content into audio data and play it. A typical TTS process includes: the text content is input to a TTS engine, which converts the text content to audio data, which is then played through a speaker. The TTS engine repeatedly performs the above process without distinction even if the same text content is encountered later. In many current application scenarios, the audio output is only an auxiliary output mode, and is usually not frequently used, and the timeliness requirement is not high, so the processing burden caused by the repetition is acceptable.

However, for some applications, such as applications dedicated to help visually impaired people to read, frequent text-to-speech conversion is required, and if the above processing method is still used, the processing amount and power consumption of the device are greatly increased, so that resources are wasted, and real-time performance is difficult to guarantee.

Disclosure of Invention

In view of the above problems, the present invention provides a scheme for text-to-speech conversion, which accelerates the conventional text-to-speech conversion, reduces the processing load of the speech synthesis module, and is very suitable for application scenarios with large text-to-speech processing requirements and high timeliness requirements.

According to one aspect of the invention, a method for text-to-speech conversion is provided. The method comprises the following steps: receiving a text to be converted; generating a unique identifier for the text to be converted; determining whether the unique identifier is stored; and if the unique identifier is determined to be saved, acquiring audio data corresponding to the unique identifier as output audio of the text to be converted.

According to another aspect of the invention, an apparatus for text-to-speech conversion is provided. The apparatus comprises: a memory having computer program code stored thereon; and a processor configured to execute the computer program code to perform the method as described above.

According to yet another aspect of the present invention, a computer-readable storage medium is provided. The computer readable storage medium has stored thereon a computer program code which, when executed, performs the method as described above.

According to a further aspect of the invention, there is provided a chip circuit comprising circuit elements configured to perform the method as described above at power-up.

Drawings

FIG. 1 illustrates a flow diagram of a method for text-to-speech conversion according to some embodiments of the invention;

FIG. 2 illustrates a flow diagram of a method for text-to-speech conversion according to further embodiments of the present invention;

FIG. 3 illustrates a flow diagram of a method for text-to-speech conversion in accordance with further embodiments of the present invention; and

FIG. 4 shows a schematic block diagram of an example device that may be used to implement an embodiment of the invention.

Detailed Description

The embodiments of the present invention will be described in detail below with reference to the accompanying drawings in order to more clearly understand the objects, features and advantages of the present invention. It should be understood that the embodiments shown in the drawings are not intended to limit the scope of the present invention, but are merely intended to illustrate the spirit of the technical solution of the present invention.

In the following description, for the purposes of illustrating various inventive embodiments, certain specific details are set forth in order to provide a thorough understanding of the various inventive embodiments. One skilled in the relevant art will recognize, however, that the embodiments may be practiced without one or more of the specific details. In other instances, well-known devices, structures and techniques associated with this application may not be shown or described in detail to avoid unnecessarily obscuring the description of the embodiments.

Throughout the specification and claims, the word "comprise" and variations thereof, such as "comprises" and "comprising," are to be understood as an open, inclusive meaning, i.e., as being interpreted to mean "including, but not limited to," unless the context requires otherwise.

Reference throughout this specification to "one embodiment" or "some embodiments" means that a particular feature, structure, or characteristic described in connection with the embodiments is included in at least one embodiment. Thus, the appearances of the phrases "in one embodiment" or "in some embodiments" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Furthermore, the terms first, second and the like used in the description and the claims are used for distinguishing objects for clarity, and do not limit the size, other order and the like of the described objects.

FIG. 1 illustrates a flow diagram of a method 100 for text-to-speech conversion according to some embodiments of the invention. The method 100 may be performed, for example, by the apparatus 400 as described below. Referring to fig. 4, fig. 4 shows a schematic block diagram of an example device 400 that may be used to implement an embodiment of the invention.

As shown in fig. 1, method 100 includes step 110 of receiving text to be converted. Here, the text to be converted may be received from the outside through various I/O interfaces (I/O interface 450 as described below). For example, the text to be converted may be input by a user through an input device such as a keyboard or a mouse. Alternatively, the text to be converted may be a text generated by a user capturing an image through an input device such as a camera and performing image recognition (e.g., Optical Character Recognition (OCR)). In some other embodiments, the text to be converted may also be predetermined text preset in the device 400, such as text for voice prompt or navigation, etc.

Next, at step 120, a unique Identifier (ID) is generated for the text to be converted. The unique identifier is used to uniquely identify the text to be converted.

In some embodiments, the unique identifier may include a field for indicating the content of the text to be converted. For example, the field may contain some encoding itself or some variation of the encoding (e.g., a hash value) of the text to be converted.

In other embodiments, the unique identifier may also include other information about the text to be converted. For example, the unique identifier may include a second field for indicating at least one of a type of speech synthesis module (e.g., TTS engine) used by device 400, a user preferred speech rate, and a timbre in addition to the first field for indicating the content of the text to be converted, as described above. Here, the speech synthesis module may be any type of speech synthesis module currently existing or developed in the future, and the type thereof may include a model (manufacturer), version, and the like of the speech synthesis module. The speech synthesis module may be implemented, for example, by CPU 410 in device 400 running a computer program for implementing speech synthesis. Alternatively, the speech synthesis module may be a separate dedicated chip or the like. As an example of a speech synthesis module, various types of TTS engines have been developed by various manufacturers (e.g., google, science news, hundredths, etc.) and can be used as the speech synthesis module described herein. The user's preferred speech rate may be set by the user at initialization or automatically by the device 400 based on the user's previous behavioral habits, such as the speech rate at which the user makes the speech input. For example, the speech rate may be set to several levels, such as fast, medium, slow, etc., or the speech rate may be set to a specific speed, such as 60-80 words/minute. The timbre of the user's preferences may also be set by the user at initialization or automatically by the device 400 based on the user's previous behavioral habits. For example, the timbre may include a male voice, a female voice, a child voice, and the like. Alternatively, the timbre may be a user-customized timbre, such as the user's own or a particular other person's voice, and the like.

Next, at step 130, a determination is made as to whether the unique identifier is stored. In one embodiment, a database is utilized in device 400 to store unique identifiers corresponding to different text. Here, the database may be a database integrated within device 400 as described below (e.g., located in ROM 420, RAM 430, storage unit 480, or other flash memory) or a separate storage device from device 400. In one embodiment, to ensure real-time output audio, the database is provided in a form that is readily accessible to the processor of the device 400, such as in a cache.

If it is determined that the unique identifier is saved (yes judgment in step 130), audio data corresponding to the unique identifier is acquired as output audio corresponding to the text to be converted in step 140. The output audio may be played to the user at step 170 as output audio corresponding to the text to be converted. In one embodiment, audio data corresponding to unique identifiers of different text and related information for the audio data may be stored in a cache of device 400. Here, the cache may be located in RAM 430 (e.g., NVRAM) of the device 400, as described below, in the storage unit 480, or in a flash memory, magnetic disk, or hard disk separate from the device 400.

In some embodiments, the information related to the unique identifier may include a starting position of the audio data in the buffer corresponding to the unique identifier and a length of the audio data.

On the other hand, if it is determined that the unique identifier is not saved (no determination of step 130), the text to be converted may be converted into audio data (e.g., by inputting the text to be converted into a speech synthesis module) in step 150. The audio data may be played as output audio corresponding to the text to be converted at step 170.

In addition, between

steps

150 and 170, the method 100 may further include step 160, in which the audio data obtained in step 150 is stored and related information of the audio data (such as a start position of the audio data in the buffer and a length of the audio data) is stored. In one embodiment, the audio data obtained in step 150 may be stored in a buffer and information related to the audio data may be stored in a database, such that the audio data and related information facilitate the next execution of the text-to-speech conversion process.

Note that while step 160 is shown in fig. 1 as being between

steps

150 and 170, those skilled in the art will appreciate that step 160 may also be performed after step 170 or in parallel with step 170 without departing from the scope of the present disclosure.

With the scheme shown in fig. 1, for the text to be converted, in which the corresponding audio data has already been stored in the device 400, the method 100 may directly acquire the audio data without performing a speech synthesis process, thereby greatly improving the real-time performance of audio data acquisition. The text to be converted, which is not stored in the device 400 with the corresponding audio data, can still be converted by the speech synthesis module, thereby ensuring the conversion effect.

FIG. 2 shows a flow diagram of a method 200 for text-to-speech conversion according to further embodiments of the present invention. The method 200 may be performed, for example, by the device 400 as described below.

In the embodiment shown in FIG. 2,

steps

210, 220, 230, 240, 250, 260, and 270 are similar to

steps

110, 120, 130, 140, 150, 160, and 170 in the embodiment shown in FIG. 1. In the embodiment shown in fig. 2, the unique identifier of the text to be converted may include a second field indicating at least one of the type of speech synthesis module used by the apparatus 400, a user-preferred speech rate and a timbre, in addition to the first field indicating the content of the text to be converted.

Unlike the embodiment shown in fig. 1, in the method 200, after the step 230, if it is determined that the unique identifier of the text to be converted is not stored (no in the step 230), it is determined whether a second identifier matching the first field of the unique identifier is stored in the step 235. In one embodiment, the first field of the matching second identifier is identical to the first field of the unique identifier of the text to be converted. In another embodiment, the first field of the matching second identifier is semantically identical or similar to the first field of the unique identifier of the text to be converted. In this case, in step 235, the text to be converted may be semantically analyzed and a second identifier having the same or similar semantics may be found from the database. Alternatively, the database may store in advance a correspondence table of a plurality of texts having the same or similar semantics. In this case, in step 235, a second identifier having the same or similar semantic meaning as that of the text to be converted may be searched for from the correspondence table.

That is, a unique identifier (i.e., a second identifier) of text that is the same as or similar to the content of the text to be converted, but that differs from other information (e.g., the type of speech synthesis module used, user preferred speech rate, and timbre) may be sought. If such a second identifier exists, it indicates that audio data of a text that is the same as or similar to the content of the text to be converted but is different from other information has been stored in the buffer.

In this case, if it is determined that the second identifier is saved (yes judgment in step 235), in step 245, audio data corresponding to the second identifier may be searched for as output audio (for example, based on the related information of the second identifier stored in the database). The output audio may be played as audio data corresponding to the text to be converted at step 270. That is, the output audio is audio data that can embody the content of the text to be converted but does not completely meet the user's needs (speech synthesis module type, user's preferred speech speed and timbre, etc.).

On the other hand, if it is determined that the second identifier is not saved (no determination in step 235), the method 200 may proceed to step 250, similar to in step 150, converting the text to be converted into audio data as output audio.

With the scheme shown in fig. 2, in the case where audio data corresponding to the unique identifier of the text to be converted is not stored but audio data identical or similar to the content itself of the text to be converted is stored, the efficiency of the entire text-to-speech conversion can be improved at the expense of limited user demands, thereby improving the user experience.

FIG. 3 illustrates a flow diagram of a method 300 for text-to-speech conversion in accordance with further embodiments of the present invention. The method 300 may be performed, for example, by the device 400 as described below.

In the embodiment shown in FIG. 3,

steps

310, 320, 330, 340, 350, 360 and 370 are similar to

steps

110, 120, 130, 140, 150, 160 and 170 in the embodiment shown in FIG. 1 and

steps

210, 220, 230, 240, 250, 260 and 270 in the embodiment shown in FIG. 2.

Unlike the embodiment shown in fig. 1 and 2, in the method 300, after the text to be converted is converted into audio data in step 350, step 355 may be further included, in which it is further determined whether the conversion of step 350 is successful. And when it is determined that the conversion in step 350 is successful (yes in step 350), step 370 is executed to play the converted audio data as output audio. Conversely, if it is determined that the conversion of step 350 is not successful, then step 380 may be performed to retrieve the predetermined specific audio data as output audio (e.g., from a buffer) and play the output audio in step 370. Here, the preset specific audio data may indicate to the user that the conversion is not successful and/or guide the user to perform a specific operation. For example, the specific audio data may be a predetermined trouble shooting guidance voice or a customer service guidance voice in the device 400 for guiding the user to troubleshoot himself or guiding the user to seek customer service help.

The methods 100 to 300 shown in fig. 1 to 3 exemplarily show some aspects according to embodiments of the present invention, and those skilled in the art can understand that these drawings do not limit the scope of the present invention, but can be combined or modified in various ways to extend the scope of the present invention. For example, in one combination of the embodiments of fig. 2 and 3,

steps

235 and 245 as shown in fig. 2 may be performed between

steps

330 and 350. In another combination of the embodiments of fig. 2 and 3,

steps

235 and 245 shown in fig. 2 may be performed instead of step 380 when the determination in step 355 is negative.

FIG. 4 shows a schematic block diagram of an example device 400 that may be used to implement an embodiment of the invention. The device 400 may be, for example, a desktop or laptop computer or other electronic device for text-to-speech conversion. As shown, device 400 may include one or more Central Processing Units (CPUs) 410 (only one shown schematically) that may perform various appropriate actions and processes, such as a TTS engine for performing TTS conversion, according to computer program instructions stored in a read-only memory (ROM)420 or computer program instructions loaded from a storage unit 480 into a Random Access Memory (RAM) 430. In the RAM 430, various programs and data required for the operation of the device 400 can also be stored. CPU 410, ROM 420 and RAM 430 are connected to each other via bus 440. An input/output (I/O) interface 450 is also connected to bus 440.

Various components in device 400 are connected to I/O interface 450, including: an input unit 460 such as a keyboard, a mouse, etc.; an output unit 470 such as various types of displays, speakers, and the like; a storage unit 480 such as a magnetic disk, an optical disk, or the like; and a communication unit 490 such as a network card, modem, wireless communication transceiver, etc. The communication unit 490 allows the device 400 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks.

The methods 100 to 300 described above may be performed, for example, by the processing unit 410 of the device 400. For example, in some embodiments, the methods 100-300 may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 480. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 400 via ROM 420 and/or communication unit 490. When the computer program is loaded into RAM 430 and executed by CPU 410, one or more operations of methods 100-300 described above may be performed. Further, the communication unit 490 may support wired or wireless communication functions.

The methods 100 to 300 and the apparatus 400 for text-to-speech conversion according to the present invention have been described above with reference to the accompanying drawings. Those skilled in the art will appreciate, however, that the device 400 need not contain all of the components shown in fig. 4, it may contain only some of the components necessary to perform the functions described in the present invention, and the manner in which these components are connected is not limited to the form shown in the drawings. For example, in the case where the device 400 is a portable device such as a cellular phone, the device 400 may have a different structure compared to that in fig. 4.

The present invention may be embodied as methods, apparatus, chip circuits, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therein for carrying out aspects of the present invention. The chip circuitry may include circuitry elements for performing various aspects of the present invention.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present invention may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing an electronic circuit, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), with state information of computer-readable program instructions, which can execute the computer-readable program instructions.

Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

According to some embodiments of the present invention, a method for text-to-speech conversion is provided. The method comprises the following steps: receiving a text to be converted; generating a unique identifier for the text to be converted; determining whether the unique identifier is stored; and if the unique identifier is determined to be stored, acquiring audio data corresponding to the unique identifier as output audio of the text to be converted.

According to some embodiments of the invention, wherein determining whether the unique identifier is stored comprises: determining whether the unique identifier is stored in a database, and wherein acquiring the audio data corresponding to the unique identifier as the output audio of the text to be converted comprises: and acquiring audio data corresponding to the unique identifier from a cache based on the related information of the unique identifier stored in the database as the output audio of the text to be converted.

According to some embodiments of the invention, the method further comprises: if the unique identifier is not stored, converting the text to be converted into audio data as the output audio; and storing the audio data and the related information of the audio data.

According to some embodiments of the invention, wherein converting the text to be converted into audio data as the output audio comprises: inputting the text to be converted into a speech synthesis module to convert the text to be converted into audio data as the output audio.

According to some embodiments of the invention, wherein the speech synthesis module comprises a TTS engine.

According to some embodiments of the invention, wherein storing the audio data and the information related to the audio data comprises: storing the audio data in a cache and storing information related to the audio data in a database.

According to some embodiments of the invention, wherein the unique identifier comprises: a first field indicating the content of the text to be converted, and a second field indicating at least one of a type of a speech synthesis module, a user preferred speech rate, and a timbre.

According to some embodiments of the invention, the method further comprises: if it is determined that the unique identifier is not stored, determining whether a second identifier matching the first field of the unique identifier is stored; and if it is determined that the second identifier is saved, searching for audio data corresponding to the second identifier as the output audio.

According to some embodiments of the invention, the method further comprises: and if the second identifier is not stored, converting the text to be converted into audio data as the output audio.

According to some embodiments of the invention, the method further comprises: determining whether the text to be converted is successfully converted into audio data; if the text to be converted is successfully converted into the audio data, playing the output audio; and if the text to be converted is determined to be unsuccessfully converted into the audio data, acquiring preset specific audio data as the output audio, wherein the specific audio data indicates that the conversion is unsuccessful and/or guides a user to perform a specific operation.

According to some embodiments of the invention, the related information comprises: the starting position of the audio data in the buffer and the length of the audio data.

There is also provided, in accordance with some embodiments of the present invention, apparatus for text to speech conversion. The apparatus comprises: a memory having computer program code stored thereon; and a processor configured to execute the computer program code to perform the method as described above.

There is also provided, in accordance with some embodiments of the present invention, a computer readable storage medium having stored thereon computer program code which, when executed, performs the method as described above.

There is also provided, in accordance with some embodiments of the present invention, a chip circuit, including a circuit unit configured to perform the method as described above at power-up.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terms used herein were chosen in order to best explain the principles of the embodiments, the practical application, or technical improvements to the techniques in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method for text-to-speech conversion, comprising:

receiving a text to be converted;

generating a unique identifier for the text to be converted;

determining whether the unique identifier is stored; and

and if the unique identifier is determined to be stored, acquiring audio data corresponding to the unique identifier as the output audio of the text to be converted.

2. The method of claim 1, wherein determining whether the unique identifier is stored comprises:

determining whether the unique identifier is stored in a database, and wherein

Acquiring the audio data corresponding to the unique identifier as the output audio of the text to be converted includes:

and acquiring audio data corresponding to the unique identifier from a cache based on the related information of the unique identifier stored in the database as the output audio of the text to be converted.

3. The method of claim 1, further comprising:

if the unique identifier is not stored, converting the text to be converted into audio data as the output audio; and

and storing the audio data and the related information of the audio data.

4. The method of claim 3, wherein storing the audio data and information related to the audio data comprises:

storing the audio data in a cache and storing information related to the audio data in a database.

5. The method of claim 1, wherein the unique identifier comprises:

a first field indicating the content of the text to be converted, and

a second field indicating at least one of a type of speech synthesis module, a user preferred speech rate, and a timbre.

6. The method of claim 5, further comprising:

if it is determined that the unique identifier is not stored, determining whether a second identifier matching the first field of the unique identifier is stored; and is

If it is determined that the second identifier is saved, audio data corresponding to the second identifier is searched for as the output audio.

7. The method of claim 3 or 6, further comprising:

determining whether the text to be converted is successfully converted into audio data;

if the text to be converted is successfully converted into the audio data, playing the output audio; and

and if the text to be converted is determined to be unsuccessfully converted into the audio data, acquiring preset specific audio data as the output audio, wherein the specific audio data indicates that the conversion is unsuccessful and/or guides a user to perform a specific operation.

8. An apparatus for text-to-speech conversion, comprising:

a memory having computer program code stored thereon; and

a processor configured to execute the computer program code to perform the method of any of claims 1 to 7.

9. A computer-readable storage medium having stored thereon computer program code which, when executed, performs the method of any of claims 1 to 7.

10. A chip circuit comprising circuit elements configured to perform the method of any one of claims 1 to 7 at power up.