CN117524191A - Method, apparatus, device and computer readable medium for speech synthesis - Google Patents

Method, apparatus, device and computer readable medium for speech synthesis Download PDF

Info

Publication number
CN117524191A
CN117524191A CN202311465649.6A CN202311465649A CN117524191A CN 117524191 A CN117524191 A CN 117524191A CN 202311465649 A CN202311465649 A CN 202311465649A CN 117524191 A CN117524191 A CN 117524191A
Authority
CN
China
Prior art keywords
phrases
prosodic
synthesized
audio
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311465649.6A
Other languages
Chinese (zh)
Inventor
马博森
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jingdong Technology Information Technology Co Ltd
Original Assignee
Jingdong Technology Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jingdong Technology Information Technology Co Ltd filed Critical Jingdong Technology Information Technology Co Ltd
Priority to CN202311465649.6A priority Critical patent/CN117524191A/en
Publication of CN117524191A publication Critical patent/CN117524191A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a method, a device, equipment and a computer readable medium for synthesizing voice, relating to the technical field of computers. One embodiment of the method comprises the following steps: positioning prosodic phrases in the synthesized text to match historical phrases with synthesized audio according to the prosodic phrases in a voice database to obtain prosodic phrases with failed matching and prosodic phrases with successful matching; inputting prosodic phrases with failed matching into an acoustic model, and outputting newly built phrase synthesized audio by the acoustic model; and constructing and outputting the voice of the synthesized text according to the arrangement sequence of prosodic phrases in the synthesized text, the synthesized audio of the newly-built phrases and the synthesized audio of the history phrases corresponding to the prosodic phrases successfully matched. The implementation mode can improve the speed of voice synthesis and ensure real-time smooth interaction with a user.

Description

Method, apparatus, device and computer readable medium for speech synthesis
Technical Field
The present invention relates to the field of computer technology, and in particular, to a method, apparatus, device, and computer readable medium for speech synthesis.
Background
With the rapid development of artificial intelligence, the speech synthesis technology is increasingly widely applied, and speech synthesis sounds are used in scenes such as reading in novels, digital people, marketing outbound, intelligent client robots and the like, and are used for robot broadcasting or real-time interaction with clients.
In the process of implementing the present invention, the inventor finds that at least the following problems exist in the prior art: in an actual scene, the real-time performance of voice synthesis is highly required. At present, the speed of speech synthesis is low, so that real-time smooth interaction with a user is difficult.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method, apparatus, device, and computer readable medium for speech synthesis, which can improve the speed of speech synthesis and ensure real-time smooth interaction with a user.
To achieve the above object, according to one aspect of the embodiments of the present invention, there is provided a method of speech synthesis, including:
positioning prosodic phrases in the synthesized text to match historical phrases with synthesized audio according to the prosodic phrases in a voice database to obtain prosodic phrases with failed matching and prosodic phrases with successful matching;
inputting prosodic phrases with failed matching into an acoustic model, and outputting newly built phrase synthesized audio by the acoustic model;
and constructing and outputting the voice of the synthesized text according to the arrangement sequence of prosodic phrases in the synthesized text, the synthesized audio of the newly-built phrases and the synthesized audio of the history phrases corresponding to the prosodic phrases successfully matched.
The locating prosodic phrases in the synthesized text to further include, prior to synthesizing audio in the speech database according to the prosodic phrases matching the history phrases:
and if the synthesis of the audio based on the synthesis text matching history statement fails in the voice database, executing the synthesis of the audio by using the prosodic phrase matching history phrase of the combined text.
The locating prosodic phrases in the synthesized text to match the history phrases in the speech database to synthesized audio according to the prosodic phrases, comprising:
positioning prosodic phrases by carrying out text normalization, word segmentation and prosody prediction on the synthesized text;
and matching the history phrase in a voice database to synthesize audio by taking the prosodic phrase as a keyword.
Inputting prosodic phrases with failed matching into an acoustic model, the acoustic model outputting new phrase synthesized audio, comprising:
sending the prosodic phrases with failed matching to a back-end server;
invoking the acoustic model to process the prosodic phrase with failed matching in the back-end server;
and receiving the new phrase synthesized audio output by the acoustic model.
The method further comprises the steps of:
and synthesizing voice of the audio and/or the synthesized text by using the newly-built phrase, and updating the voice database.
The synthesized text comprises an electronic book text or a commodity introduction text. .
The speech database is used for storing phrase audio synthesized audio of synthesized text and sentence synthesized audio of synthesized text, wherein the synthesized text comprises user identification.
According to a second aspect of an embodiment of the present invention, there is provided an apparatus for speech synthesis, including:
the matching module is used for positioning prosodic phrases in the synthesized text so as to match historical phrases in the voice database according to the prosodic phrases to synthesize audio, and obtaining prosodic phrases with failed matching and prosodic phrases with successful matching;
the new building module is used for inputting prosodic phrases with failed matching into an acoustic model, and the acoustic model outputs new built phrases to synthesize audio;
and the output module is used for constructing and outputting the voice of the synthesized text according to the arrangement sequence of the prosodic phrases in the synthesized text, the synthesized audio of the newly-built phrases and the synthesized audio of the history phrase corresponding to the prosodic phrases successfully matched.
According to a third aspect of an embodiment of the present invention, there is provided an electronic device for speech synthesis, including:
one or more processors;
storage means for storing one or more programs,
the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the methods as described above.
According to a fourth aspect of embodiments of the present invention, there is provided a computer readable medium having stored thereon a computer program which when executed by a processor implements a method as described above.
One embodiment of the above invention has the following advantages or benefits: positioning prosodic phrases in the synthesized text to match historical phrases with synthesized audio according to the prosodic phrases in a voice database to obtain prosodic phrases with failed matching and prosodic phrases with successful matching; inputting prosodic phrases with failed matching into an acoustic model, and outputting newly built phrase synthesized audio by the acoustic model; and constructing and outputting the voice of the synthesized text according to the arrangement sequence of prosodic phrases in the synthesized text, the synthesized audio of the newly-built phrases and the synthesized audio of the history phrases corresponding to the prosodic phrases successfully matched. The prosodic phrase is used as a basis for matching the audio, so that the speed of speech synthesis can be improved, and real-time smooth interaction with a user is ensured.
Further effects of the above-described non-conventional alternatives are described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
FIG. 1 is a schematic flow diagram of a method of speech synthesis according to an embodiment of the invention;
FIG. 2 is a flow diagram of locating prosodic phrases according to an embodiment of the invention;
FIG. 3 is a flow diagram of invoking an acoustic model to output audio in accordance with an embodiment of the present invention;
FIG. 4 is a schematic diagram of a first speech synthesis according to an embodiment of the invention;
FIG. 5 is a schematic diagram of a re-speech synthesis according to an embodiment of the invention;
fig. 6 is a main structural diagram of an apparatus for speech synthesis according to an embodiment of the present invention;
FIG. 7 is an exemplary system architecture diagram in which embodiments of the present invention may be applied;
fig. 8 is a schematic diagram of a computer system suitable for use in implementing an embodiment of the invention.
Detailed Description
Exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present invention are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Real-time smooth interaction with users, and synthesis of more speech in a shorter time, are constantly pursuing and breaking through goals in the industry.
At present, the following mainstream technologies mainly realize the reasoning acceleration of speech synthesis:
mode one: hardware architecture acceleration using graphics processor (Graphics Processing Unit, GPU) and tensor processor (Tensor Processing Unit, TPU) is used. The use of hardware architectures such as GPU and TPU not only accelerates the cost, but also requires some adaptation and development work.
Mode two: the calculation amount of model reasoning is reduced by means of model quantization, distillation and the like, so that the real-time rate is improved. The above approach is also often detrimental to the speech synthesis effect, with quality being traded for performance.
Mode three: the synthesized audio is cached, and the same text and request parameters can be directly obtained from the cache and returned to the audio. Caching techniques are generally directed to request text for fixed speech and fixed parameters, and cannot hit caches with variables and changing synthesis parameters. Wherein the synthesis parameters include speech rate and speaker, etc.
In order to solve the technical problem that the voice synthesis is difficult to smoothly interact with a user in real time due to the low speed, the following technical scheme in the embodiment of the invention can be adopted.
Referring to fig. 1, fig. 1 is a schematic flow chart of a method for synthesizing speech according to an embodiment of the present invention, and prosodic phrases in a synthesized text are used as matching bases to improve matching success rate and speed. As shown in fig. 1, the method specifically comprises the following steps:
s101, positioning prosodic phrases in the synthesized text to match historical phrases in a voice database according to the prosodic phrases to synthesize audio, so as to obtain prosodic phrases with failed matching and prosodic phrases with successful matching.
In embodiments of the present invention, it is desirable to convert synthesized text to speech. The synthesized text is text that needs to be converted to speech. As an example, the synthesized text is text that needs to be converted into speech in the following application scenario: novel reading, digital personal, marketing callouts, and intelligent client robots.
In one embodiment of the invention, the synthesized text is text received from an application. As an example, a user terminal installs an Application (APP), the user terminal operates the APP, operates in the APP to obtain a synthesized text, and needs to convert the synthesized text into speech.
In particular, prosodic phrases need to be located in the synthesized text. Prosodic words are the smallest unit of language that can be freely used. Prosodic phrases are regularly structured, stopping and musical range-converging of steps that may be controlled using rules, possibly multi-steps. The prosodic phrase may be reverse-aligned, multi-step or alternatively multi-step. The intonation phrase is a segment of pronunciation with a complete intonation that audibly and independently forms a sentence.
Referring to fig. 2, fig. 2 is a flow chart illustrating locating prosodic phrases according to an embodiment of the invention. The method specifically comprises the following steps:
s201, positioning prosodic phrases through text normalization, word segmentation and prosody prediction on the synthesized text.
The synthesized text belongs to natural language, and in order to facilitate computer processing of the synthesized text, prosodic phrases need to be positioned through text normalization, word segmentation and prosodic prediction.
Text normalization is mainly to synthesize text by regular expression matching and then to perform rule substitution. As one example, the synthesized text includes: "number 12", hit through "(\d+) number" regular matching, and call the related method to normalize "number 12" text into Chinese character "twelve".
Word segmentation is the segmentation of continuous text into individual tokens. The word segmentation may be implemented using natural language processing tools and/or language models.
Prosody prediction is the conversion of words into symbolic descriptions. As one example, the symbol description includes: rhythm level, boundary location, accent level, location, intonation type, etc. Prosody prediction may be implemented using a prosody prediction model.
Prosody prediction is exemplified below.
The prosodic hierarchy is: prosodic Words (PW), identified as # 1; prosodic Phrase (PPH), identified by # 2; intonation short (IPH), identified as # 3.
The synthesized text after word segmentation is as follows: please ask you are Zhang Sanyi?
Please ask #2 you are #2 three #2 principal # 1? #3
Wherein, prosodic words include: i; the prosodic phrase includes: asking for a question; you are; thirdly, stretching; the intonation phrases include: please ask you are Zhang Sanyi?
In embodiments of the present invention, after the prosodic phrases and intonation phrases are determined, the prosodic phrases may be located. As one example, the first prosodic phrase is in the position of the first character to the second character of the first intonation phrase; the second prosodic phrase is located in the fourth through eighth characters of the first intonation phrase. The first intonation phrase includes all characters preceding the first period.
It should be noted that, the prosodic words have a smaller corresponding audio data amount than the prosodic phrases and intonation sentences. In the embodiment of the invention, in order to improve the speed of speech synthesis, the audio of the prosodic phrase and the audio of the intonation short sentence are used as the history phrase to synthesize the audio.
S202, using prosodic phrases as keywords, and matching historical phrases in a voice database to synthesize audio.
Historical phrase synthesized audio is stored in the speech database. The history phrase synthesized audio is phrase synthesized audio corresponding to the existing prosodic phrases in the speech synthesis process. It will be appreciated that as the number of speech synthesis increases, the number of history phrase synthesized audio increases.
In one embodiment of the invention, the speech database is used to store phrase audio synthesized audio of synthesized text and sentence synthesized audio of synthesized text, the synthesized text including a user identification. That is, the voice database is associated with the user. For users, the same kind of synthesized text is mostly adopted to realize the speech synthesis. The similar synthetic text has more repeated words, and the voice database is utilized to realize voice synthesis, so that the speed of voice synthesis is improved.
In the embodiment of fig. 2, matching with prosodic phrases in the speech database can increase the speed of speech synthesis. Specifically, the history phrase is matched in a voice database according to the prosodic phrases to synthesize audio, so that prosodic phrases with failed matching and prosodic phrases with successful matching are obtained.
S102, inputting prosodic phrases with failed matching into an acoustic model, and outputting new phrase synthesis audio by the acoustic model.
Not all prosodic phrases and all intonation phrases are included in the speech database. In the event of a match failure in the speech database, the prosodic phrase that failed the match may be input into the acoustic model in the back-end server.
Referring to fig. 3, fig. 3 is a schematic flow chart of invoking an acoustic model to output audio according to an embodiment of the present invention. The method specifically comprises the following steps:
s301, sending prosodic phrases with failed matching to a back-end server.
In an embodiment of the invention, matching prosodic phrases in the speech database is performed at the front end. In case of failure of matching, the prosodic phrase that failed to match needs to be sent to the back-end server. The back-end server is more computationally powerful than the front-end.
S302, in the back-end server, invoking an acoustic model to process prosodic phrases with failed matching.
At the backend server, an acoustic model may be invoked to process prosodic phrases that fail to match. As one example, the acoustic model includes a (Grapheme-To-Phoneme, G2P) model. The G2P model may convert chinese or english words into phonemes. The phoneme is the smallest pronunciation unit. Such as: "Zhang Sano" is converted into "zh ang1 s an1".
S303, receiving newly built phrases output by the acoustic model to synthesize audio.
The back-end server calls the audio output by the acoustic model, namely, after the prosodic phrase with failed matching is input into the acoustic model, the new phrase output by the acoustic model synthesizes the audio. And sending the newly-built phrase synthesized audio to the front end so as to meet the requirement of voice synthesis.
In the embodiment of fig. 3, the back-end server invokes an acoustic model to convert prosodic phrases that failed to match to audio.
S103, constructing and outputting the voice of the synthesized text according to the arrangement sequence of prosodic phrases in the synthesized text, the synthesized audio of the newly built phrases and the synthesized audio of the history phrases corresponding to the prosodic phrases successfully matched.
The phrase synthesized audio corresponding to the prosodic phrase in the synthesized text relates to the history phrase synthesized audio corresponding to the prosodic phrase successfully matched with the newly created phrase synthesized audio. The arrangement order of the prosodic phrases in the synthesized text is fixed, and the voice of the synthesized text can be constructed according to the arrangement order of the prosodic phrases, the synthesized audio of the newly built phrases and the synthesized audio of the history phrases corresponding to the successfully matched prosodic phrases. And then outputs the speech of the synthesized text.
In one embodiment of the invention, the speech database is updated with the new phrase synthesized audio and/or synthesized text speech. The more voices in the voice database, the faster the voice synthesis. In order to increase the speed of speech synthesis, the speech database is updated to meet the requirements of speech synthesis by newly creating phrases to synthesize the speech of the audio and/or the synthesized text.
In one embodiment of the invention, to increase the efficiency of speech synthesis, prosodic phrases are located in the synthesized text to synthesize audio based on the synthesized text matching history sentences in the speech database before audio is synthesized according to prosodic phrase matching history phrases in the speech database. If synthesizing the audio based on the synthesized text matching history statement fails, performing prosodic phrase matching history phrase synthesizing the audio to merge the text.
It will be appreciated that audio may be synthesized by matching historical sentences in the speech database based on the synthesized text. Wherein, the historical speech synthesis audio is the existing audio corresponding to the sentence. If the synthesized text is successfully matched with the history statement synthesis audio in the voice database, prosodic phrase matching is not needed, and the voice synthesis speed can be improved. If the synthesized text fails to be matched with the history statement in the voice database to synthesize the audio, the prosodic phrases are matched. As one example, the synthesized text includes intonation phrases.
In the embodiment of the invention, the prosodic phrases are positioned in the synthesized text so as to match the history phrases in the voice database according to the prosodic phrases to synthesize the audio, and the prosodic phrases with failed matching and the prosodic phrases with successful matching are obtained; inputting prosodic phrases with failed matching into an acoustic model, and outputting newly built phrase synthesized audio by the acoustic model; and constructing and outputting the voice of the synthesized text according to the arrangement sequence of prosodic phrases in the synthesized text, the synthesized audio of the newly-built phrases and the synthesized audio of the history phrases corresponding to the prosodic phrases successfully matched. The prosodic phrase is used as a basis for matching the audio, so that the speed of speech synthesis can be improved, and real-time smooth interaction with a user is ensured.
Referring to fig. 4, fig. 4 is a schematic diagram of a first speech synthesis according to an embodiment of the present invention. The synthesized text includes: please ask you are Zhang Sanyi? If the matching of the synthesized text in the voice database fails, locating the prosodic phrase. The prosodic phrase includes: "please ask", "you are", "Zhang Sano" and "my own".
Because of the first speech synthesis, the number of voices in the voice database is small, and the matching of the prosodic phrases in the voice database fails. The new phrase synthesized audio needs to be output through an acoustic model in the back-end server. And synthesizing audio by using the arrangement sequence of the prosodic phrases and the newly-built phrases, and constructing and outputting the voice of the synthesized text. The newly created phrase synthesized speech is then stored in a speech database.
Wherein, the 'principal' belongs to prosodic words. The amount of audio data corresponding to prosodic words is smaller than that of prosodic phrases and intonation sentences. The audio of the prosodic phrase and the intonation sentence is stored in the voice database as the new phrase synthesized voice, and the voice synthesis speed can be improved under the condition that the new phrase synthesized voice, namely the history phrase synthesized audio, is called again.
Referring to fig. 5, fig. 5 is a schematic diagram of re-speech synthesis according to an embodiment of the present invention. The synthesized text includes: please ask you are four people of the li? If the matching of the synthesized text in the voice database fails, locating the prosodic phrase. The prosodic phrase includes: "please ask", "you are", "Liqu" and "my own".
Because of the re-speech synthesis, the speech database predicts the history phrase synthesized audio of "please ask", "you are" and "do by oneself", and the prosodic phrases are successfully matched in the speech database. "Liqu" requires the output of a new phrase synthesized audio through an acoustic model in the back-end server. And constructing and outputting the voice of the synthesized text according to the arrangement sequence of the prosodic phrases, the history phrase synthesized audio and the newly-built phrase synthesized audio. The newly created phrase synthesized speech is then stored in a speech database.
Therefore, by adopting the technical scheme in the embodiment of the invention, partial prosodic phrases in the synthesized text are replaced by 'Zhang San' and 'Li Si', and only new phrase synthesized voices of 'Li Si' prosodic phrases are independently synthesized, phrase synthesized voices of other prosodic phrases are directly obtained from a voice database, and finally, voices of the synthesized text are constructed. The workload of calling the acoustic model to synthesize the voice in the back-end server is greatly reduced, so that the voice synthesis speed is improved, and real-time smooth interaction with a user is ensured.
Referring to fig. 6, fig. 6 is a schematic main structure diagram of a speech synthesis apparatus according to an embodiment of the present invention, where the speech synthesis apparatus may implement a method for speech synthesis, and as shown in fig. 6, the speech synthesis apparatus specifically includes:
a matching module 601, configured to locate prosodic phrases in the synthesized text, so as to match the history phrases in the speech database to synthesize audio according to the prosodic phrases, thereby obtaining prosodic phrases that fail to match and prosodic phrases that succeed to match;
a new module 602, configured to input a prosodic phrase that fails to match into an acoustic model, where the acoustic model outputs a new phrase synthesized audio;
and an output module 603, configured to construct and output a speech of the synthesized text according to the arrangement order of prosodic phrases in the synthesized text, the synthesized audio of the newly-built phrase, and the synthesized audio of the prosodic phrase corresponding to the history phrase that is successfully matched.
The matching module 601 is further configured to execute, in the speech database, if the synthesis of the audio based on the synthesized text matching history sentence fails, the synthesis of the audio with the prosodic phrase matching history phrase of the combined text.
In one embodiment of the present invention, the matching module 601 is specifically configured to locate a prosodic phrase by normalizing the synthesized text by text, word segmentation and prosody prediction;
and matching the history phrase in a voice database to synthesize audio by taking the prosodic phrase as a keyword.
In one embodiment of the present invention, the new module 602 is specifically configured to send the prosodic phrase that fails to match to the backend server;
invoking the acoustic model to process the prosodic phrase with failed matching in the back-end server;
and receiving the new phrase synthesized audio output by the acoustic model.
In one embodiment of the present invention, the output module 603 is further configured to synthesize audio and/or speech of the synthesized text with the newly created phrase, and update the speech database.
In one embodiment of the invention, the synthetic text includes electronic book text or merchandise introduction text. In one embodiment of the present invention, the speech database is configured to store phrase audio synthesized audio at a user synthesized text end of the synthesized text and sentence synthesized audio of the synthesized text, the synthesized text including a user identification.
Fig. 7 illustrates an exemplary system architecture 700 of a speech synthesis method or apparatus to which embodiments of the invention may be applied.
As shown in fig. 7, a system architecture 700 may include terminal devices 701, 702, 703, a network 704, and a server 705. The network 704 is the medium used to provide communication links between the terminal devices 701, 702, 703 and the server 705. The network 704 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
A user may interact with the server 705 via the network 704 using the terminal devices 701, 702, 703 to receive or send messages or the like. Various communication client applications such as shopping class applications, web browser applications, search class applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only) may be installed on the terminal devices 701, 702, 703.
The terminal devices 701, 702, 703 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.
The server 705 may be a server providing various services, such as a background management server (by way of example only) providing support for shopping-type websites browsed by users using the terminal devices 701, 702, 703. The background management server may analyze and process the received data such as the product information query request, and feedback the processing result (e.g., the target push information, the product information—only an example) to the terminal device.
It should be noted that, the method for synthesizing speech provided in the embodiment of the present invention is generally executed by the server 705, and accordingly, the device for synthesizing speech is generally disposed in the server 705.
It should be understood that the number of terminal devices, networks and servers in fig. 7 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring now to FIG. 8, there is illustrated a schematic diagram of a computer system 800 suitable for use in implementing an embodiment of the present invention. The terminal device shown in fig. 8 is only an example, and should not impose any limitation on the functions and the scope of use of the embodiment of the present invention.
As shown in fig. 8, the computer system 800 includes a Central Processing Unit (CPU) 801 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the system 800 are also stored. The CPU 801, ROM 802, and RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.
The following components are connected to the I/O interface 805: an input portion 806 including a keyboard, mouse, etc.; an output portion 807 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage section 808 including a hard disk or the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. The drive 810 is also connected to the I/O interface 805 as needed. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as needed so that a computer program read out therefrom is mounted into the storage section 808 as needed.
In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication section 809, and/or installed from the removable media 811. The above-described functions defined in the system of the present invention are performed when the computer program is executed by a Central Processing Unit (CPU) 801.
The computer readable medium shown in the present invention may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules involved in the embodiments of the present invention may be implemented in software or in hardware. The described modules may also be provided in a processor, for example, as: a processor includes a matching module, a new module, and an output module. Wherein the names of these modules do not constitute a limitation on the module itself in some cases, for example, the matching module may also be described as "for locating prosodic phrases in synthesized text to match historical phrases in speech database with the prosodic phrases to synthesized audio resulting in prosodic phrases that failed to match and prosodic phrases that succeeded in matching.
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be present alone without being fitted into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to include:
positioning prosodic phrases in the synthesized text to match historical phrases with synthesized audio according to the prosodic phrases in a voice database to obtain prosodic phrases with failed matching and prosodic phrases with successful matching;
inputting prosodic phrases with failed matching into an acoustic model, and outputting newly built phrase synthesized audio by the acoustic model;
and constructing and outputting the voice of the synthesized text according to the arrangement sequence of prosodic phrases in the synthesized text, the synthesized audio of the newly-built phrases and the synthesized audio of the history phrases corresponding to the prosodic phrases successfully matched.
According to the technical scheme of the embodiment of the invention, prosodic phrases are positioned in the synthesized text, so that the audio is synthesized according to the prosodic phrases matching history phrases in a voice database, and prosodic phrases with failed matching and prosodic phrases with successful matching are obtained; inputting prosodic phrases with failed matching into an acoustic model, and outputting newly built phrase synthesized audio by the acoustic model; and constructing and outputting the voice of the synthesized text according to the arrangement sequence of prosodic phrases in the synthesized text, the synthesized audio of the newly-built phrases and the synthesized audio of the history phrases corresponding to the prosodic phrases successfully matched. The prosodic phrase is used as a basis for matching the audio, so that the speed of speech synthesis can be improved, and real-time smooth interaction with a user is ensured.
The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives can occur depending upon design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention. It should be noted that, in the technical solution of the present disclosure, the acquisition, storage, application, etc. of the related personal information of the user all conform to the rules of the related laws and regulations, and do not violate the popular regulations of the public order.

Claims (10)

1. A method of speech synthesis, comprising:
positioning prosodic phrases in the synthesized text to match historical phrases with synthesized audio according to the prosodic phrases in a voice database to obtain prosodic phrases with failed matching and prosodic phrases with successful matching;
inputting prosodic phrases with failed matching into an acoustic model, and outputting newly built phrase synthesized audio by the acoustic model;
and constructing and outputting the voice of the synthesized text according to the arrangement sequence of prosodic phrases in the synthesized text, the synthesized audio of the newly-built phrases and the synthesized audio of the history phrases corresponding to the prosodic phrases successfully matched.
2. The method of speech synthesis according to claim 1, wherein locating prosodic phrases in the synthesized text to match historical phrases in the speech database before synthesizing audio according to the prosodic phrases further comprises:
and if the synthesis of the audio based on the synthesis text matching history statement fails in the voice database, executing the synthesis of the audio by using the prosodic phrase matching history phrase of the combined text.
3. The method of speech synthesis according to claim 1, wherein locating prosodic phrases in the synthesized text to match historical phrases in a speech database to synthesized audio according to the prosodic phrases, comprising:
positioning prosodic phrases by carrying out text normalization, word segmentation and prosody prediction on the synthesized text;
and matching the history phrase in a voice database to synthesize audio by taking the prosodic phrase as a keyword.
4. The method of speech synthesis according to claim 1, wherein the inputting prosodic phrases that fail to match into an acoustic model that outputs newly built phrase synthesized audio includes:
sending the prosodic phrases with failed matching to a back-end server;
invoking the acoustic model to process the prosodic phrase with failed matching in the back-end server;
and receiving the new phrase synthesized audio output by the acoustic model.
5. The method of speech synthesis according to claim 1, further comprising:
and synthesizing voice of the audio and/or the synthesized text by using the newly-built phrase, and updating the voice database.
6. The method of claim 1, wherein the synthesized text comprises electronic book text or merchandise introduction text.
7. The method of speech synthesis according to claim 1, wherein the speech database is used to store phrase audio synthesis audio of synthesized text and sentence synthesis audio of synthesized text, the synthesized text including a user identification.
8. An apparatus for speech synthesis, comprising:
the matching module is used for positioning prosodic phrases in the synthesized text so as to match historical phrases in the voice database according to the prosodic phrases to synthesize audio, and obtaining prosodic phrases with failed matching and prosodic phrases with successful matching;
the new building module is used for inputting prosodic phrases with failed matching into an acoustic model, and the acoustic model outputs new built phrases to synthesize audio;
and the output module is used for constructing and outputting the voice of the synthesized text according to the arrangement sequence of the prosodic phrases in the synthesized text, the synthesized audio of the newly-built phrases and the synthesized audio of the history phrase corresponding to the prosodic phrases successfully matched.
9. An electronic device for speech synthesis, comprising:
one or more processors;
storage means for storing one or more programs,
when executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-7.
10. A computer readable medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-7.
CN202311465649.6A 2023-11-06 2023-11-06 Method, apparatus, device and computer readable medium for speech synthesis Pending CN117524191A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311465649.6A CN117524191A (en) 2023-11-06 2023-11-06 Method, apparatus, device and computer readable medium for speech synthesis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311465649.6A CN117524191A (en) 2023-11-06 2023-11-06 Method, apparatus, device and computer readable medium for speech synthesis

Publications (1)

Publication Number Publication Date
CN117524191A true CN117524191A (en) 2024-02-06

Family

ID=89741091

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311465649.6A Pending CN117524191A (en) 2023-11-06 2023-11-06 Method, apparatus, device and computer readable medium for speech synthesis

Country Status (1)

Country Link
CN (1) CN117524191A (en)

Similar Documents

Publication Publication Date Title
CN108630190B (en) Method and apparatus for generating speech synthesis model
CN112489620B (en) Speech synthesis method, device, readable medium and electronic equipment
CN112309366B (en) Speech synthesis method, speech synthesis device, storage medium and electronic equipment
CN110197655B (en) Method and apparatus for synthesizing speech
US10824664B2 (en) Method and apparatus for providing text push information responsive to a voice query request
CN112489621B (en) Speech synthesis method, device, readable medium and electronic equipment
CN112786007A (en) Speech synthesis method, device, readable medium and electronic equipment
CN110827805A (en) Speech recognition model training method, speech recognition method and device
CN112786008A (en) Speech synthesis method, device, readable medium and electronic equipment
CN112259089A (en) Voice recognition method and device
US20210082408A1 (en) Generating acoustic sequences via neural networks using combined prosody info
CN112927674B (en) Voice style migration method and device, readable medium and electronic equipment
CN111696521A (en) Method for training speech clone model, readable storage medium and speech clone method
CN113327580A (en) Speech synthesis method, device, readable medium and electronic equipment
JP2022133408A (en) Speech conversion method and system, electronic apparatus, readable storage medium, and computer program
US11295732B2 (en) Dynamic interpolation for hybrid language models
CN114550702A (en) Voice recognition method and device
CN111862961A (en) Method and device for recognizing voice
CN111681661B (en) Speech recognition method, apparatus, electronic device and computer readable medium
WO2023179506A1 (en) Prosody prediction method and apparatus, and readable medium and electronic device
KR102611024B1 (en) Voice synthesis method and device, equipment and computer storage medium
CN117524191A (en) Method, apparatus, device and computer readable medium for speech synthesis
JP2022133447A (en) Speech processing method and device, electronic apparatus, and storage medium
CN114999440A (en) Avatar generation method, apparatus, device, storage medium, and program product
WO2019097335A1 (en) Phonetic patterns for fuzzy matching in natural language processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination