CN110600004A - Voice synthesis playing method and device and storage medium - Google Patents

Voice synthesis playing method and device and storage medium Download PDF

Info

Publication number
CN110600004A
CN110600004A CN201910848598.2A CN201910848598A CN110600004A CN 110600004 A CN110600004 A CN 110600004A CN 201910848598 A CN201910848598 A CN 201910848598A CN 110600004 A CN110600004 A CN 110600004A
Authority
CN
China
Prior art keywords
voice
synthesized
text
synthesis
pronunciation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910848598.2A
Other languages
Chinese (zh)
Inventor
杨木文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201910848598.2A priority Critical patent/CN110600004A/en
Publication of CN110600004A publication Critical patent/CN110600004A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/086Detection of language

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The embodiment of the invention discloses a voice synthesis playing method, a device and a storage medium, wherein a user terminal can receive a voice synthesis request, acquire a text to be synthesized, which needs to be subjected to voice synthesis, according to the voice synthesis request, then send the text to be synthesized to a voice synthesis server for voice synthesis to obtain corresponding synthesized voice, then play the synthesized voice, receive a pronunciation correction request for the synthesized voice, receive correction data corresponding to the synthesized voice according to the pronunciation correction request, send the correction data to the voice synthesis server for updating the synthesized voice, thereby obtaining updated synthesized voice, and replace the currently played synthesized voice with the updated synthesized voice for playing. Compared with the prior art, the method and the device can correct and update the played synthesized voice in real time in the process of playing the synthesized voice, so that the pronunciation of the polyphone can be corrected in time even if the pronunciation prediction of the polyphone is wrong.

Description

Voice synthesis playing method and device and storage medium
Technical Field
The invention relates to the technical field of voice, in particular to a voice synthesis playing method, a voice synthesis playing device and a storage medium.
Background
Speech synthesis technology, also known as Text To Speech (TTS), aims at allowing a machine To recognize and understand Text information and convert the Text information into Speech output, so that the machine can speak, and is an important branch of future human-computer interaction.
The speech synthesis technology is widely applied, such as reading of web page contents, reading of novel words, reading of emails and the like. For example, when the novel is read with voice, the user terminal such as a mobile phone or a tablet computer can read the novel read by the user through voice synthesis, so that the user can close the eyes to 'see' the novel.
In the process of research and practice of the prior art, the inventor of the invention finds that the polyphone processing capability of the prior speech synthesis technology has defects, and the pronunciation of the polyphone cannot be accurately predicted when the user faces unusual context.
Disclosure of Invention
Embodiments of the present invention provide a method, an apparatus, and a storage medium for speech synthesis playing, which can correct pronunciation of a polyphone in time when the pronunciation prediction of the polyphone is incorrect.
The embodiment of the invention provides a voice synthesis playing method, which comprises the following steps:
receiving a voice synthesis request, and acquiring a text to be synthesized, which needs to be subjected to voice synthesis, according to the voice synthesis request;
sending the text to be synthesized to a voice synthesis server for voice synthesis, so that the voice synthesis server returns a synthesized voice corresponding to the text to be synthesized;
playing the synthesized voice and receiving a pronunciation correction request for the synthesized voice;
receiving input correction data corresponding to the synthesized voice according to the pronunciation correction request, and sending the correction data to the voice synthesis server, so that the voice synthesis server updates the synthesized voice according to the correction data and returns the updated synthesized voice;
and replacing the currently played synthesized voice with the updated synthesized voice for playing.
The embodiment of the invention also provides a voice synthesis playing method, which comprises the following steps:
when a text to be synthesized from a user terminal is received, carrying out voice synthesis on the text to be synthesized according to a pre-trained voice synthesis model to obtain synthesized voice;
returning the synthesized voice to the user terminal for playing, and receiving correction data corresponding to the synthesized voice returned by the user terminal;
updating the synthesized voice according to the correction data to obtain updated synthesized voice;
and returning the updated synthesized voice to the user terminal, so that the user terminal replaces the synthesized voice with the updated synthesized voice for playing.
An embodiment of the present invention further provides a speech synthesis playing apparatus, including:
the text acquisition module is used for receiving a voice synthesis request and acquiring a text to be synthesized, which needs voice synthesis, according to the voice synthesis request;
the voice synthesis module is used for sending the text to be synthesized to a voice synthesis server for voice synthesis, so that the voice synthesis server returns the synthesized voice corresponding to the text to be synthesized;
the voice playing module is used for playing the synthesized voice and receiving a pronunciation correction request of the synthesized voice;
the text correction module is used for receiving input correction data corresponding to the synthesized voice according to the pronunciation correction request and sending the correction data to the voice synthesis server, so that the voice synthesis server updates the synthesized voice according to the correction data and returns the updated synthesized voice;
the voice playing module is further configured to replace the currently played synthesized voice with the updated synthesized voice for playing.
In one embodiment, the text correction module, in receiving correction data corresponding to the synthesized speech based on the pronunciation correction request, is configured to:
displaying a pronunciation correction interface according to the pronunciation correction request, wherein the pronunciation correction interface comprises a character input control and a pronunciation control;
receiving target words needing to be corrected in the text to be synthesized based on the word input control;
receiving a target pronunciation corresponding to the target character based on the pronunciation control;
setting the target word and the target pronunciation as the correction data.
In one embodiment, upon receiving a target pronunciation for a corresponding target word based on a pronunciation control, the text correction module is to:
checking whether the target character is a polyphone character;
when the target character is judged to be a polyphone character, acquiring a plurality of pronunciations corresponding to the target character according to a preset corresponding relation between the polyphone character and the pronunciations;
displaying the plurality of pronunciations based on the pronunciation control, and receiving selection operation of the displayed pronunciations;
and setting the pronunciation corresponding to the selection operation as the target pronunciation of the target character.
In an embodiment, when acquiring a text to be synthesized, which needs to be subjected to speech synthesis, according to a speech synthesis request, the text acquisition module is configured to:
extracting a text in the display content of the foreground application according to the voice synthesis request to obtain an extracted text;
dividing the extracted text into a plurality of clauses according to a preset clause strategy;
and setting the clauses as the texts to be synthesized.
In an embodiment, in the process of playing the synthesized speech, the speech playing module is further configured to:
and marking the clauses corresponding to the synthesized voice according to a preset rule.
In one embodiment, upon receiving a pronunciation correction request for the synthesized speech, the speech playback module is configured to:
displaying a pronunciation correction control in a preset range of the clause corresponding to the synthesized voice;
an utterance correction request for the synthesized speech is received based on the utterance correction control.
In an embodiment, the speech synthesis playing apparatus provided in the embodiment of the present invention further includes a data storage module, configured to:
and storing the text to be synthesized, the synthesized voice and/or the updated synthesized voice into a distributed system.
The embodiment of the invention also provides a voice synthesis playing device, which comprises a voice synthesis module, a voice issuing module and a voice updating module, wherein,
the voice synthesis module is used for carrying out voice synthesis on the text to be synthesized according to a pre-trained voice synthesis model when receiving the text to be synthesized from the user terminal to obtain synthesized voice;
the voice issuing module is used for returning the synthesized voice to the user terminal for playing and receiving the correction data corresponding to the text to be synthesized returned by the user terminal;
the voice updating module is used for updating the synthesized voice according to the correction data to obtain the updated synthesized voice;
the voice issuing module is further configured to return the updated synthesized voice to the user terminal, so that the user terminal replaces the synthesized voice with the updated synthesized voice to play the synthesized voice.
In an embodiment, the speech synthesis playing apparatus provided in the embodiment of the present invention further includes a model updating module, configured to:
and updating the voice synthesis model according to the text to be synthesized and the correction data.
In an embodiment, the speech synthesis playing apparatus provided in the embodiment of the present invention further includes a data storage module, configured to:
and storing the text to be synthesized, the synthesized voice and/or the updated synthesized voice into a distributed system.
In addition, an embodiment of the present invention further provides a storage medium, where multiple instructions are stored in the storage medium, and the instructions are suitable for being loaded by a processor to execute any one of the speech synthesis playing methods provided in the embodiment of the present invention.
The method comprises the steps of receiving a voice synthesis request through a user terminal, acquiring a text to be synthesized, which needs to be subjected to voice synthesis, according to the voice synthesis request, sending the text to be synthesized to a voice synthesis server for voice synthesis to obtain corresponding synthesized voice, playing the synthesized voice, receiving a pronunciation correction request for the synthesized voice, receiving correction data corresponding to the synthesized voice according to the pronunciation correction request, sending the correction data to the voice synthesis server for updating the synthesized voice, thus obtaining the updated synthesized voice, and replacing the currently played synthesized voice with the updated synthesized voice for playing. Compared with the prior art, the method and the device can correct and update the played synthesized voice in real time in the process of playing the synthesized voice, so that the pronunciation of the polyphone can be corrected in time even if the pronunciation prediction of the polyphone is wrong.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic diagram of an architecture of a speech synthesis playing system according to an embodiment of the present invention;
FIG. 2a is a flow chart of a speech synthesis playing method according to an embodiment of the present invention;
FIG. 2b is a diagram illustrating a speech synthesis control in an embodiment of the present invention;
FIG. 2c is a diagram illustrating a phrase one corresponding to the identified synthesized speech according to an embodiment of the present invention;
FIG. 2d is another diagram of a phrase corresponding to a recognized synthesized speech in an embodiment of the present invention;
FIG. 2e is a diagram illustrating a pronunciation correction control according to an embodiment of the present invention;
FIG. 2f is a diagram illustrating a pronunciation correction interface in an embodiment of the present invention;
FIG. 2g is a schematic diagram of a distributed system according to an embodiment of the present invention;
FIG. 2h is a block diagram of an embodiment of the invention;
fig. 3 is another schematic flow chart of a speech synthesis playing method according to an embodiment of the present invention;
fig. 4 is another schematic flow chart of a speech synthesis playing method according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a speech synthesis playing apparatus according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a speech synthesis playing apparatus according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of a user terminal in an embodiment of the present invention;
fig. 8 is a schematic structural diagram of a speech synthesis server in an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Key branches of Speech Technology (Speech Technology) are Automatic Speech Recognition (ASR) and Speech synthesis (Text To Speech, TTS) as well as voiceprint Recognition. The computer can listen, see, speak and feel, and the development direction of future human-computer interaction is provided, wherein the voice synthesis technology becomes one of the best human-computer interaction modes in the future.
Early speech synthesis was generally implemented using dedicated chips, such as TMS50C10/TMS50C57 from Texas instruments, Philips PH84H36, etc., but was mainly used in household appliances and children's toys.
Today's speech synthesis is generally implemented by pure software, and the text-to-speech conversion process is as follows: firstly, preprocessing, word segmentation, part of speech tagging, polyphone prediction, rhythm level prediction and the like are carried out on a text, then, acoustic characteristics corresponding to all units are predicted through an acoustic model, finally, acoustic parameters are utilized to directly synthesize sound through a vocoder, or units are selected from a recording word bank to be spliced, so that the voice corresponding to the text is generated.
For Chinese speech synthesis, the key research directions at present are Chinese prosody processing, symbolic digit, polyphone prediction, word formation, etc., and continuous research is needed to improve the naturalization degree of Chinese speech synthesis.
The polyphone prediction is one of the bases of Chinese speech synthesis, whether the polyphone pronunciations are correct or not greatly influences the semantic understanding condition of a listener on the synthesized sounds, and if the polyphone prediction accuracy is high, the user experience is greatly improved, so that the synthesized speech is easy to understand and can be heard more naturally and smoothly.
At present, for polyphones, the following synthesis strategy is mostly adopted in the existing speech synthesis:
if the polyphones can form words with the context, performing speech synthesis according to the polyphones in the fixed collocation, such as a point of repetition (zhong4) and a point of repetition (chong2), wherein the numbers after pinyin represent tones;
if the polyphone appears in the form of a single character, the pronunciation of the polyphone is predicted by using a speech synthesis model trained by using a large amount of sample data in advance, for example, (wei4) people service is provided, and the result is (wei2) zero.
The common training method of the speech synthesis model includes, but is not limited to: a Conditional Random Field (CRF) method, a Hidden Markov Model (HMM) method, a decision tree method, and the like. These methods are characterized by the need for a large number of pronunciations of polyphonic characters to train. The method has the advantages that the pronunciation of the polyphone can be predicted only by text, the accuracy rate of the polyphone prediction is high in the common context, and the processing capacity of the polyphone in the uncommon context is poor.
Based on the above defects in the prior art, embodiments of the present invention provide a method, an apparatus, and a storage medium for playing speech synthesis. The method, the device and the storage medium are suitable for voice synthesis playing of the user terminal, and suitable for voice synthesis playing of the voice synthesis server.
Referring to fig. 1, an embodiment of the present invention further provides a speech synthesis playing system, where the speech synthesis playing system includes a user terminal 10, a speech synthesis server 20, and a network 30 (which may be a wired network or a wireless network), and the user terminal 10 interacts with the speech synthesis server 20 through the network 30. The network 30 includes network entities such as routers and gateways, which are not shown in fig. 1.
Based on the speech synthesis playing system shown in fig. 1, the user terminal 10 may receive a speech synthesis request, obtain a text to be synthesized, which needs to be subjected to speech synthesis, according to the speech synthesis request, and then send the text to be synthesized to the speech synthesis server 20; after receiving the text to be synthesized from the user terminal 10, the speech synthesis server 20 performs speech synthesis on the text to be synthesized to obtain corresponding synthesized speech, and returns the synthesized speech to the user terminal 10; the user terminal 10, after receiving the synthesized voice returned by the voice synthesis server 20, plays the synthesized voice, and receives a pronunciation correction request for the synthesized voice, and then receives correction data corresponding to the synthesized voice according to the pronunciation correction request, and sends the correction data to the voice synthesis server 20; the speech synthesis server 20, after receiving the correction data from the user terminal 10, updates the synthesized speech based on the correction data and returns the updated synthesized speech to the user terminal 10; after receiving the updated synthesized voice returned from the voice synthesis server 20, the user terminal 10 replaces the currently played synthesized voice with the updated synthesized voice and plays the synthesized voice.
It should be noted that fig. 1 illustrates only one example of a system architecture for implementing the embodiment of the present invention, and the embodiment of the present invention is not limited to the system architecture illustrated in fig. 1. Based on the system architecture, detailed descriptions are given below. The order of the following examples is not intended to limit the preferred order of the examples.
The first embodiment,
The embodiment of the invention provides a voice synthesis playing method, which is suitable for a user terminal and comprises the following steps: receiving a voice synthesis request, and acquiring a text to be synthesized, which needs to be subjected to voice synthesis, according to the voice synthesis request; sending the text to be synthesized to a voice synthesis server for voice synthesis, so that the voice synthesis server returns the synthesized voice corresponding to the text to be synthesized; playing the synthesized voice and receiving a pronunciation correction request for the synthesized voice; receiving correction data corresponding to the synthesized voice according to the pronunciation correction request, and sending the correction data to the voice synthesis server, so that the voice synthesis server updates the synthesized voice according to the correction data and returns the updated synthesized voice; and replacing the currently played synthesized voice with the updated synthesized voice for playing.
Referring to fig. 2a, the process of the speech synthesis playing method may be as follows:
and 201, receiving a voice synthesis request, and acquiring a text to be synthesized, which needs to be subjected to voice synthesis, according to the voice synthesis request.
In the embodiment of the invention, the user terminal can receive the voice synthesis request input from the outside in real time, thereby triggering the voice synthesis and converting the corresponding text into the voice for outputting. The user terminal can receive the voice synthesis request directly input by the user, and can also receive the voice synthesis request input by other user terminals.
Illustratively, the speech synthesis request may be input to the user terminal in a number of different ways for the user.
For example, the user may speak "please read the current interface/full text" or the like in a manner of a voice instruction, so as to input a voice synthesis request for instructing to perform voice synthesis on the text in the current interface (e.g., a web browsing interface, a text browsing interface, or the like) or the full text to the user terminal.
For another example, the user terminal is provided with a speech synthesis control for inputting a speech synthesis request, as shown in fig. 2b, the speech synthesis control may be in the form of a button, and the "reading" identifier enables the user to directly click the speech synthesis control to input a speech synthesis request for instructing speech synthesis of text in the current interface to the user terminal.
It should be noted that, those skilled in the art can configure the user terminal according to actual needs, so that the user terminal can also receive the voice synthesis request input by other ways not shown above.
After receiving the voice synthesis request, the user terminal further obtains the text to be synthesized, which needs to be subjected to voice synthesis, according to the voice synthesis request.
In an embodiment, "obtaining a text to be synthesized that needs to be subjected to speech synthesis according to a speech synthesis request" includes:
(1) extracting a text in the display content of the foreground application according to the voice synthesis request to obtain an extracted text;
(2) dividing the extracted text into a plurality of clauses according to a preset clause strategy;
(3) and setting the clauses as texts to be synthesized.
In the embodiment of the present invention, the speech synthesis request is used to instruct the user terminal to perform speech synthesis on the text in the foreground application of the user terminal, where the foreground application is the application currently displayed by the user.
Correspondingly, when the user terminal acquires the text to be synthesized, which needs to be subjected to voice synthesis, according to the voice synthesis request, the user terminal firstly extracts the text in the display content applied by the foreground according to the received voice synthesis request, and records the extracted text as the extracted text.
For example, when the user terminal receives a voice synthesis request during browsing a web page through a browser application, the user terminal receives a DOM tree of the web page according to the voice synthesis request, and extracts a text in the web page based on a text density calculation method, such as an information article text or a novel chapter content;
for another example, when the user terminal receives a speech synthesis request during browsing a local document (e.g., a text file in a format such as txt, word, etc.) through a text reading application, the local document is coded and decoded according to the speech synthesis request, and pure text content coded in GB2312 is parsed.
And after the extracted text corresponding to the foreground application is extracted, the user terminal further divides the extracted text into a plurality of clauses according to a preset clause strategy. It should be noted that, in the embodiment of the present invention, the configuration of the preset clause policy is not specifically limited, and may be configured by a person of ordinary skill in the art according to actual needs, for example, in the embodiment of the present invention, the configured preset clause policy is to perform clause according to punctuation marks and lengths.
And for the plurality of clauses obtained by division, the user terminal sets each clause as a text to be synthesized in sequence so as to perform voice synthesis on each clause.
202, sending the text to be synthesized to a voice synthesis server for voice synthesis, so that the voice synthesis server returns the synthesized voice corresponding to the text to be synthesized.
After acquiring a text to be synthesized, which needs to be subjected to voice synthesis, a user terminal constructs a voice synthesis capability request carrying the text to be synthesized according to a preset data format, sends the voice synthesis energy request to a voice synthesis server, and instructs the voice synthesis server to perform voice synthesis.
Illustratively, the following is a data format schematic of a speech synthesis capability request:
the header represents a request header of a voice synthesis capability request, the header represents a unique identifier of a user terminal, the qua represents equipment and application information of the user terminal, the user represents user information, the header represents the user _ id represents the unique identifier of the user, the header represents user location information, the header represents longitude, the header represents latitude, the header represents an IP address of the user terminal, and the header represents a network type of the user terminal.
The payload indicates a request content of a voice synthesis capability request, the payload _ speech _ meta indicates voice configuration information, the payload _ speech _ meta _ compress indicates a compression type, the payload _ speech _ meta _ person indicates a speaker, the payload _ speech _ meta _ volume indicates a sound volume, the payload _ speech _ speed indicates a sound speed, the payload _ speech _ meta _ pitch indicates a tone, the payload _ session _ ID indicates a session ID, the payload _ index indicates a requested voice piece number, the payload _ speech _ request indicates a voice synthesis type, the payload _ content indicates a content of a voice synthesis, and the payload _ content is used to fill in a text to be synthesized.
On the other hand, when receiving a speech synthesis capability request from the user terminal, the speech synthesis server performs speech synthesis according to the speech synthesis energy request to obtain a synthesized speech corresponding to the text to be synthesized, and returns the synthesized speech to the user terminal.
For example, corresponding to the data format of the speech synthesis energy request shown above, the data format of the speech synthesis server returning the synthesized speech is as follows:
wherein, the header represents a message header, the header represents a session, the header represents the session ID, the payload represents a message body, the payload represents whether the speech _ finish is finished, and the payload represents the speech _ Base64 represents the Base64 data of the synthesized speech.
203, playing the synthesized speech, and receiving a pronunciation correction request for the synthesized speech.
The user terminal plays the synthesized voice after receiving the synthesized voice returned by the voice synthesis server, and receives a pronunciation correction request for the synthesized voice in the process of playing the synthesized voice.
In an embodiment, in the process of playing the synthesized speech, the speech synthesis playing method provided in the embodiment of the present invention further includes:
and identifying the clauses corresponding to the synthesized voice according to a preset rule.
In the embodiment of the invention, the user terminal can establish a speech synthesis playing interface covering the foreground application after extracting the extracted text of the foreground application, and display the extracted text in the speech synthesis playing interface.
In the process of playing the synthesized voice, the user terminal marks the clause corresponding to the synthesized voice according to a preset rule, wherein the clause corresponding to the synthesized voice is marked to be highlighted so that the clause is displayed differently from other clauses, and further the user can quickly position the clause being played from the extracted text. It should be noted that, in the embodiment of the present invention, the configuration manner of the preset rule is not specifically limited, and a person skilled in the art may configure the preset rule according to actual needs.
For example, the preset rule may be configured to increase the display scale, as shown in fig. 2c, and set that the clause of the text to be synthesized is "start-time royal jelly is not understood", which has a larger display scale than other clauses, so that it is clearly distinguished from other clauses.
For another example, the preset rule may be configured to adjust the display color, as shown in fig. 2d, so that the clause of the text to be synthesized is "the beginning royal jelly is not understood", and the clause has a different display color compared to other clauses, so that the clause is obviously different from other clauses.
In one embodiment, "receiving a pronunciation correction request for synthesized speech" includes:
(1) displaying a pronunciation correction control in a preset range of a clause corresponding to the synthesized voice;
(2) an articulation correction request for the synthesized speech is received based on the articulation correction control.
In the embodiment of the invention, in the process of playing the synthesized voice, the user terminal not only marks the clause corresponding to the synthesized voice according to the preset rule, but also displays the pronunciation correction control in the preset range of the clause corresponding to the synthesized voice, so as to receive the pronunciation correction request of the played synthesized voice through the pronunciation correction control. It should be noted that, for the configuration in the preset range, the embodiment of the present invention is not particularly limited, and the configuration may be performed by a person skilled in the art according to actual needs.
For example, as shown in fig. 2e, the phrase "the royal jelly does not understand at the beginning" is set as a text to be synthesized, and the user terminal marks "the royal jelly does not understand at the beginning" by changing the display color during the process of playing the synthesized speech corresponding to "the royal jelly does not understand at the beginning", and at the same time, the user terminal displays the pronunciation correction control at the end of "the royal jelly does not understand at the beginning" so that the user can input the pronunciation correction request to the user terminal by clicking the pronunciation correction control.
And 204, receiving correction data corresponding to the synthesized voice according to the pronunciation correction request, sending the correction data to the voice synthesis server, enabling the voice synthesis server to update the synthesized voice according to the correction data, and returning the updated synthesized voice.
In the embodiment of the present invention, after receiving an input pronunciation correction request, the user terminal further receives correction data corresponding to the synthesized voice according to the pronunciation correction request, and after receiving the correction data corresponding to the synthesized voice, the user terminal sends the correction data to the voice synthesis server, so that the voice synthesis server updates the synthesized voice according to the correction data, and returns the updated synthesized voice. The correction data includes the word to be corrected and the correct pronunciation.
For example, the user terminal may send the correction data by using the voice synthesis capability request, but the difference from the data format of the voice synthesis capability request shown above is that two fields, namely "report _ correction" and "correction _ photo", are additionally added, where report _ correction is used to indicate whether the voice synthesis capability request sent this time is used to update the synthesized voice, when the written value is "true", the update is indicated, when the written value is "false", the voice synthesis is normally performed, and the correction _ photo is used to write the correction data. When the voice synthesis server receives a voice synthesis capability request from the user terminal, whether the voice synthesis server is the updated synthesized voice is determined according to the report _ correction, if the voice synthesis server is the updated synthesized voice, correction data and the text to be synthesized are extracted from the corrected _ voice, new synthesized voice is obtained by re-synthesizing according to the correction data and the text to be synthesized, and the new synthesized voice is set as the updated synthesized voice and is returned to the user terminal.
In one embodiment, "receiving correction data corresponding to a synthesized speech according to a pronunciation correction request," includes:
(1) displaying a pronunciation correction interface according to the pronunciation correction request, wherein the pronunciation correction interface comprises a character input control and a pronunciation control;
(2) receiving target words needing to be corrected in the text to be synthesized based on the word input control;
(3) receiving a target pronunciation corresponding to the target character based on the pronunciation control;
(4) the target word and the target utterance are set as correction data.
In the embodiment of the invention, when receiving the correction data corresponding to the synthesized voice according to the pronunciation correction request, the user terminal firstly displays a pronunciation correction interface according to the received pronunciation correction request, wherein the pronunciation correction interface comprises a character input control and a pronunciation control, the character input control is used for receiving a target character to be corrected in a text to be synthesized, and the pronunciation control is used for receiving and receiving a target pronunciation of the target character.
Therefore, the user terminal can receive the target character which needs to be corrected and is input by the user based on the character input control, receive the target pronunciation corresponding to the target character based on the pronunciation control, and set the target character and the corresponding target pronunciation as the correction data.
Before the target word and the corresponding target pronunciation are set as the correction data, the user terminal further identifies whether the target word belongs to the text to be synthesized corresponding to the synthesized voice, and sets the received target word and the target pronunciation as the correction data corresponding to the played synthesized voice only when the target word belongs to the text to be synthesized, so that the accuracy of the correction of the synthesized voice is ensured.
In one embodiment, "receiving a target pronunciation for a corresponding target word based on a pronunciation control" includes:
(1) checking whether the target character is a polyphone character;
(2) when the target character is judged to be a polyphone character, acquiring a plurality of pronunciations corresponding to the target character according to a preset corresponding relation between the polyphone character and the pronunciations;
(3) displaying a plurality of pronunciations based on the pronunciation control, and receiving selection operation of the displayed pronunciations;
(4) and setting the pronunciation corresponding to the selection operation as the target pronunciation of the target character.
In the embodiment of the invention, in order to further ensure the accuracy of the correction of the synthesized voice, when the user terminal receives the target pronunciation of the corresponding target character based on the pronunciation control, whether the target character is a polyphone character is firstly checked, so that the error correction caused by the error input of the user is eliminated at the source.
For example, the user terminal is configured with a polyphonic database in advance, the polyphonic database stores known polyphonic words, and the user terminal can query whether a target word input by the user exists in the polyphonic database when the target word is verified to be the polyphonic word, and if the target word exists, the verification is passed, and the target word input by the user is judged to be the polyphonic word.
And when the input target character is judged to be a polyphone character, the user terminal further acquires a plurality of pronunciations corresponding to the target character according to the preset corresponding relation between the polyphone character and the pronunciations. Then, the user terminal displays the acquired multiple pronunciations corresponding to the target characters based on the pronunciation control, receives selection operation on the displayed pronunciations, and sets the pronunciations corresponding to the selection operation as target pronunciations of the target characters.
Illustratively, referring to fig. 2f, the pronunciation correction interface shows:
clauses set as text to be synthesized "begin time wangbaole does not understand";
the method comprises the following steps that a character input control in a form of an input box and first prompt information 'please input characters needing to be corrected', a user is prompted to input target characters needing to be corrected, for example, 'happy' is input by the user in a diagram;
selecting a frame-form pronunciation control and a second prompt message "please select correct pronunciation", prompting the user to select correct pronunciation as a target pronunciation, wherein the number of the pronunciation controls is the same as the number of the acquired target character pronunciations, such as a pronunciation control showing pronunciation "le 4" and a pronunciation control showing pronunciation "yue 4";
and the reporting correction control is used for indicating that the input is finished, and when the input is finished, the reporting correction control can be clicked, so that the user terminal acquires the target word 'le' and the corresponding target pronunciation 'le 4'.
And 205, replacing the currently played synthesized voice with the updated synthesized voice for playing.
When receiving the updated synthesized voice returned by the voice synthesis server, the user terminal replaces the currently played synthesized voice with the updated synthesized voice to play, so as to realize pronunciation correction of the synthesized voice.
In an embodiment, the method for speech synthesis and playing provided by the embodiment of the present invention further includes:
storing the text to be synthesized, the synthesized speech, and/or the updated synthesized speech in the distributed system.
Taking a distributed system as an example of a blockchain system, please refer To fig. 2g, where fig. 2g is an optional structural schematic diagram of the distributed system 100 applied To a blockchain according To an embodiment of the present invention, and the system is formed by a plurality of nodes (the user terminal, the other user terminals, and the speech synthesis server mentioned in the above embodiment of the present invention) and clients, a Peer-To-Peer (P2P, Peer To Peer) network is formed between the nodes, and the P2P Protocol is an application layer Protocol operating on a Transmission Control Protocol (TCP). The node comprises a hardware layer, a middle layer, an operating system layer and an application layer.
Referring to the functions of each node in the blockchain system shown in fig. 2g, the functions involved include:
1) routing, a basic function that a node has, is used to support communication between nodes.
Besides the routing function, the node may also have the following functions:
2) the application is used for being deployed in a block chain, realizing specific services according to actual service requirements, recording data related to the realization functions to form recording data, carrying a digital signature in the recording data to represent a source of task data, and sending the recording data to other nodes in the block chain system, so that the other nodes add the recording data to a temporary block when the source and integrity of the recording data are verified successfully.
For example, the services implemented by the application include:
2.1) wallet, for providing the function of transaction of electronic money, including initiating transaction (i.e. sending the transaction record of current transaction to other nodes in the blockchain system, after the other nodes are successfully verified, storing the record data of transaction in the temporary blocks of the blockchain as the response of confirming the transaction is valid; of course, the wallet also supports the querying of the remaining electronic money in the electronic money address;
and 2.2) sharing the account book, wherein the shared account book is used for providing functions of operations such as storage, query and modification of account data, record data of the operations on the account data are sent to other nodes in the block chain system, and after the other nodes verify the validity, the record data are stored in a temporary block as a response for acknowledging that the account data are valid, and confirmation can be sent to the node initiating the operations.
2.3) Intelligent contracts, computerized agreements, which can enforce the terms of a contract, implemented by codes deployed on a shared ledger for execution when certain conditions are met, for completing automated transactions according to actual business requirement codes, such as querying the logistics status of goods purchased by a buyer, transferring the buyer's electronic money to the merchant's address after the buyer signs for the goods; of course, smart contracts are not limited to executing contracts for trading, but may also execute contracts that process received information.
3) And the Block chain comprises a series of blocks (blocks) which are mutually connected according to the generated chronological order, new blocks cannot be removed once being added into the Block chain, and recorded data submitted by nodes in the Block chain system are recorded in the blocks.
Referring to fig. 2h, fig. 2h is an optional schematic diagram of a Block Structure (Block Structure) according to an embodiment of the present invention, where each Block includes a hash value of a transaction record stored in the Block (hash value of the Block) and a hash value of a previous Block, and the blocks are connected by the hash value to form a Block chain. The block may include information such as a time stamp at the time of block generation. A block chain (Blockchain), which is essentially a decentralized database, is a string of data blocks associated by using cryptography, and each data block contains related information for verifying the validity (anti-counterfeiting) of the information and generating a next block.
In the embodiment of the present invention, the user terminal may further store the text to be synthesized and the synthesized speech corresponding thereto and/or the updated synthesized speech in the speech synthesis process to the distributed system where the user terminal is located, so as to record the text and the synthesized speech.
As can be seen from the above, in the embodiment of the present invention, the user terminal may receive the voice synthesis request, obtain a to-be-synthesized text that needs to be subjected to voice synthesis according to the voice synthesis request, then send the to-be-synthesized text to the voice synthesis server to perform voice synthesis, obtain a corresponding synthesized voice, then play the synthesized voice, receive a pronunciation correction request for the synthesized voice, receive correction data corresponding to the synthesized voice according to the pronunciation correction request, send the correction data to the voice synthesis server for updating the synthesized voice, thereby obtain an updated synthesized voice, and replace the currently played synthesized voice with the updated synthesized voice to play. Compared with the prior art, the method and the device can correct and update the played synthesized voice in real time in the process of playing the synthesized voice, so that the pronunciation of the polyphone can be corrected in time even if the pronunciation prediction of the polyphone is wrong.
Example II,
The embodiment of the invention also provides a voice synthesis playing method, which is suitable for a voice synthesis server and comprises the following steps: when a text to be synthesized from a user terminal is received, carrying out voice synthesis on the text to be synthesized according to a pre-trained voice synthesis model to obtain synthesized voice; the synthesized voice is returned to the user terminal for playing, and correction data corresponding to the synthesized voice returned by the user terminal is received; updating the synthesized voice according to the correction data to obtain the updated synthesized voice; and returning the updated synthesized voice to the user terminal, so that the user terminal replaces the synthesized voice with the updated synthesized voice to play.
Referring to fig. 3, the flow of the speech synthesis playing method may be as follows:
301, when receiving the text to be synthesized from the user terminal, performing speech synthesis on the text to be synthesized according to the pre-trained speech synthesis model to obtain the synthesized speech.
When receiving an input voice synthesis request, the user terminal acquires a text to be synthesized, which needs to be subjected to voice synthesis, according to the voice synthesis request. Correspondingly, the voice synthesis server receives the text to be synthesized from the user terminal, and when the text to be synthesized from the user terminal is received, the voice synthesis server performs voice synthesis on the text to be synthesized according to the pre-trained voice synthesis model to obtain the synthesized voice.
It should be noted that the speech synthesis Model may be obtained by pre-training using a Conditional Random Field (CRF) method, a Hidden Markov Model (HMM) method, a decision tree method, and the like, which is not described herein again.
302, returning the synthesized voice to the user terminal for playing, and receiving the correction data corresponding to the synthesized voice returned by the user terminal.
And after the voice synthesis server obtains the synthesized voice through synthesis, the synthesized voice is returned to the user terminal for playing.
On the other hand, the user terminal receives a pronunciation correction request for the synthesized voice during the process of playing the synthesized voice, receives the input correction data according to the pronunciation correction request, and transmits the correction data to the voice synthesis server. Correspondingly, the voice synthesis server also receives correction data corresponding to the text to be synthesized, which is returned by the user terminal.
303, the synthesized speech is updated based on the corrected data to obtain an updated synthesized speech.
After receiving the correction data returned by the user terminal, the speech synthesis server obtains an updated synthesized speech according to the synthesized speech synthesized before the correction data is updated.
For example, the pronunciation of "music" in the synthesized speech is "yue 4" and the updated pronunciation of "music" in the synthesized speech is "le 4" in the case of the text to be synthesized "the royal music does not understand at the beginning".
And 304, returning the updated synthesized voice to the user terminal, so that the user terminal replaces the synthesized voice with the updated synthesized voice to play.
After the updated synthesized voice is obtained, the voice synthesis server returns the updated synthesized voice to the electronic equipment, so that the electronic equipment replaces the currently played synthesized voice with the updated synthesized voice to play by mechanical energy, and pronunciation correction is realized.
It should be noted that, for the specific description of the embodiment of the present invention, reference may be made to the related description in the above embodiment of the speech synthesis playing method applied to the user terminal, and details are not described here again.
In an embodiment, the method for speech synthesis and playing provided by the embodiment of the present invention further includes:
and updating the voice synthesis model according to the text to be synthesized and the correction data.
For example, after receiving the correction data from the user terminal each time, the speech synthesis server stores the correction data and the text to be synthesized as the training corpus in a pre-created database, enriches the training corpus continuously, and updates the speech synthesis model by adopting a supervised learning manner according to the stored training corpus when the training corpus in the database is accumulated to a preset number, so that the speech synthesis model can predict the pronunciation of the polyphone more accurately.
In an embodiment, the speech synthesis method provided in the embodiment of the present invention further includes:
storing the text to be synthesized, the synthesized speech, and/or the updated synthesized speech in the distributed system.
Taking the distributed system as an example of a blockchain system, the blockchain system is formed by a plurality of nodes (the user terminal and the voice synthesis server mentioned in the above embodiments of the present invention, etc.) and a client.
In the embodiment of the invention, the voice synthesis server can also store the text to be synthesized and the corresponding synthesized voice and/or the updated synthesized voice in the voice synthesis process to the distributed system where the text and the corresponding synthesized voice are/is located so as to record the text and the corresponding synthesized voice.
Example III,
The method according to the preceding example is further illustrated below by way of example.
As shown in fig. 4, the flow of the speech synthesis playing method may be as follows:
401, the user terminal receives the voice synthesis request, extracts a text in the display content of the foreground application according to the voice synthesis request to obtain an extracted text, divides the extracted text into a plurality of clauses according to a preset clause strategy, sets the clauses obtained by division as texts to be synthesized in sequence, and sends the texts to be synthesized to the voice synthesis server.
In the embodiment of the invention, the user terminal can receive the voice synthesis request input from the outside in real time, thereby triggering the voice synthesis and converting the corresponding text into the voice for outputting.
After receiving the voice synthesis request, the user terminal further obtains the text to be synthesized, which needs to be subjected to voice synthesis, according to the voice synthesis request. Firstly, extracting a text in the display content of foreground application according to a received voice synthesis request, and recording the extracted text as an extracted text.
For example, when the user terminal receives a voice synthesis request during browsing a web page through a browser application, the user terminal receives a DOM tree of the web page according to the voice synthesis request, and extracts a text in the web page based on a text density calculation method, such as an information article text or a novel chapter content;
for another example, when the user terminal receives a speech synthesis request during browsing a local document (e.g., a text file in a format such as txt, word, etc.) through a text reading application, the local document is coded and decoded according to the speech synthesis request, and pure text content coded in GB2312 is parsed.
And after the extracted text corresponding to the foreground application is extracted, the user terminal further divides the extracted text into a plurality of clauses according to a preset clause strategy. It should be noted that, in the embodiment of the present invention, the configuration of the preset clause policy is not specifically limited, and may be configured by a person of ordinary skill in the art according to actual needs, for example, in the embodiment of the present invention, the configured preset clause policy is to perform clause according to punctuation marks and lengths.
And for the plurality of clauses obtained by division, the user terminal sets each clause as a text to be synthesized in sequence so as to perform voice synthesis on each clause.
After acquiring a text to be synthesized, which needs to be subjected to voice synthesis, a user terminal constructs a voice synthesis capability request carrying the text to be synthesized according to a preset data format, sends the voice synthesis energy request to a voice synthesis server, and instructs the voice synthesis server to perform voice synthesis.
And 402, performing voice synthesis on the text to be synthesized by the voice synthesis server according to the pre-trained voice synthesis model to obtain synthesized voice, and returning the synthesized voice to the user terminal.
It should be noted that the speech synthesis Model may be obtained by pre-training using a Conditional Random Field (CRF) method, a Hidden Markov Model (HMM) method, a decision tree method, and the like, which is not described herein again.
The voice synthesis server receives a text to be synthesized from the user terminal, and when the text to be synthesized from the user terminal is received, carries out voice synthesis on the text to be synthesized according to a pre-trained voice synthesis model to obtain synthesized voice, and returns the synthesized voice to the user terminal.
And 403, playing the synthesized voice by the user terminal, marking the clause corresponding to the synthesized voice according to a preset rule, and displaying the pronunciation correction control in a preset range of the clause.
The user terminal receives 404 a pronunciation correction request for the synthesized speech based on the pronunciation correction control.
And after receiving the synthesized voice returned by the voice synthesis server, the user terminal plays the synthesized voice, and marks clauses corresponding to the synthesized voice according to a preset rule in the process of playing the synthesized voice.
The clauses corresponding to the synthesized voice are identified, so that the clauses are highlighted, the clauses are displayed differently from other clauses, and a user can quickly locate the clause being played from the extracted text. It should be noted that, in the embodiment of the present invention, the configuration manner of the preset rule is not specifically limited, and a person skilled in the art may configure the preset rule according to actual needs.
Besides identifying the clauses corresponding to the synthesized voice according to the preset rules, the user terminal also displays a pronunciation correction control in the preset range of the clauses corresponding to the synthesized voice, so that a pronunciation correction request for the played synthesized voice is received through the pronunciation correction control. It should be noted that, for the configuration in the preset range, the embodiment of the present invention is not particularly limited, and the configuration may be performed by a person skilled in the art according to actual needs.
405, the user terminal displays a pronunciation correction interface according to the pronunciation correction request, wherein the pronunciation correction interface comprises a character input control and a pronunciation control;
and 406, the user terminal receives a target word to be corrected in the text to be synthesized based on the word input control, receives a target pronunciation corresponding to the target word based on the pronunciation control, sets the target word and the target pronunciation as correction data, and sends the correction data to the voice synthesis server.
In the embodiment of the invention, when receiving the correction data corresponding to the synthesized voice according to the pronunciation correction request, the user terminal firstly displays a pronunciation correction interface according to the received pronunciation correction request, wherein the pronunciation correction interface comprises a character input control and a pronunciation control, the character input control is used for receiving a target character to be corrected in a text to be synthesized, and the pronunciation control is used for receiving and receiving a target pronunciation of the target character.
For example, referring to fig. 2f, the pronunciation correction interface shows:
clauses set as text to be synthesized "begin time wangbaole does not understand";
the method comprises the following steps that a character input control in a form of an input box and first prompt information 'please input characters needing to be corrected', a user is prompted to input target characters needing to be corrected, for example, 'happy' is input by the user in a diagram;
selecting a frame-form pronunciation control and a second prompt message "please select correct pronunciation", prompting the user to select correct pronunciation as a target pronunciation, wherein the number of the pronunciation controls is the same as the number of the acquired target character pronunciations, such as a pronunciation control showing pronunciation "le 4" and a pronunciation control showing pronunciation "yue 4";
and the reporting correction control is used for indicating that the input is finished, and when the input is finished, the reporting correction control can be clicked, so that the user terminal acquires the target word 'le' and the corresponding target pronunciation 'le 4'.
The user terminal receives a target word which is input by a user and needs to be corrected based on the word input control, and sets the target word and a target pronunciation corresponding to the target word as correction data after receiving the target pronunciation corresponding to the target word based on the pronunciation control, and sends the correction data to the voice synthesis server.
And 407, the voice synthesis server updates the synthesized voice according to the correction data to obtain the updated synthesized voice, and returns the synthesized voice to the user terminal.
After receiving the correction data returned by the user terminal, the voice synthesis server obtains the updated synthesized voice according to the synthesized voice synthesized before the correction data is updated, and returns the synthesized voice to the user terminal. A
For example, the pronunciation of "music" in the synthesized speech is "yue 4" and the updated pronunciation of "music" in the synthesized speech is "le 4" in the case of the text to be synthesized "the royal music does not understand at the beginning".
And 408, the user terminal replaces the currently played synthesized voice with the updated synthesized voice for playing.
When receiving the updated synthesized voice returned by the voice synthesis server, the user terminal replaces the currently played synthesized voice with the updated synthesized voice to play, so as to realize pronunciation correction of the synthesized voice.
Example four,
In order to better implement the above speech synthesis playing method, an embodiment of the present invention further provides a speech synthesis playing device, and the speech synthesis playing device may be specifically integrated in a user terminal.
For example, as shown in fig. 5, the speech synthesis playing apparatus may include a text acquisition module 501, a speech synthesis module 502, a speech playing module 503, and a text correction module 504, as follows:
the text obtaining module 501 is configured to receive a speech synthesis request, and obtain a text to be synthesized, which needs to be subjected to speech synthesis, according to the speech synthesis request.
The speech synthesis module 502 is configured to send the text to be synthesized to the speech synthesis server for speech synthesis, so that the speech synthesis server returns a synthesized speech corresponding to the text to be synthesized;
a voice playing module 503, configured to play the synthesized voice and receive a pronunciation correction request for the synthesized voice;
a text correction module 504, configured to receive correction data corresponding to the synthesized speech according to the pronunciation correction request, and send the correction data to the speech synthesis server, so that the speech synthesis server updates the synthesized speech according to the correction data, and returns the updated synthesized speech;
the voice playing module 503 is further configured to replace the currently played synthesized voice with the updated synthesized voice for playing.
In one embodiment, upon receiving correction data corresponding to the synthesized speech based on the pronunciation correction request, the text correction module 504 is configured to:
displaying a pronunciation correction interface according to the pronunciation correction request, wherein the pronunciation correction interface comprises a character input control and a pronunciation control;
receiving target words needing to be corrected in the text to be synthesized based on the word input control;
receiving a target pronunciation corresponding to the target character based on the pronunciation control;
the target word and the target utterance are set as correction data.
In one embodiment, upon receiving a target pronunciation for a corresponding target word based on the pronunciation control, the text correction module 504 is configured to:
checking whether the target character is a polyphone character;
when the target character is judged to be a polyphone character, acquiring a plurality of pronunciations corresponding to the target character according to a preset corresponding relation between the polyphone character and the pronunciations;
displaying a plurality of pronunciations based on the pronunciation control, and receiving selection operation of the displayed pronunciations;
and setting the pronunciation corresponding to the selection operation as the target pronunciation of the target character.
In an embodiment, when acquiring a text to be synthesized that needs to be subjected to speech synthesis according to a speech synthesis request, the text acquisition module 501 is configured to:
extracting a text in the display content of the foreground application according to the voice synthesis request to obtain an extracted text;
dividing the extracted text into a plurality of clauses according to a preset clause strategy;
and setting the clauses as texts to be synthesized.
In one embodiment, in playing the synthesized speech, the speech playing module 503 is further configured to:
and identifying the clauses corresponding to the synthesized voice according to a preset rule.
In one embodiment, upon receiving a pronunciation correction request for the synthesized speech, the speech playing module 503 is configured to:
displaying a pronunciation correction control in a preset range of a clause corresponding to the synthesized voice;
an articulation correction request for the synthesized speech is received based on the articulation correction control.
In an embodiment, the speech synthesis playing apparatus provided in the embodiment of the present invention further includes a data storage module, configured to:
storing the text to be synthesized, the synthesized speech, and/or the updated synthesized speech in the distributed system.
Example V,
In order to better implement the above intelligent retrieval method, an embodiment of the present invention further provides a speech synthesis playing device, and the speech synthesis playing device may be specifically integrated in a speech synthesis server.
For example, as shown in fig. 6, the speech synthesis playing apparatus may include a speech synthesis module 601, a speech sending module 602, and a speech updating module 603, as follows:
the speech synthesis module 601 is configured to, when receiving a text to be synthesized from a user terminal, perform speech synthesis on the text to be synthesized according to a pre-trained speech synthesis model to obtain a synthesized speech;
the voice issuing module 602 is configured to return the synthesized voice to the user terminal for playing, and receive correction data corresponding to the synthesized voice returned by the user terminal;
a voice updating module 603, configured to update the synthesized voice according to the correction data to obtain an updated synthesized voice;
the voice issuing module 602 is further configured to return the updated synthesized voice to the user terminal, so that the user terminal replaces the synthesized voice with the updated synthesized voice to play.
In an embodiment, the speech synthesis playing apparatus provided in the embodiment of the present invention further includes a model updating module, configured to:
and updating the voice synthesis model according to the text to be synthesized and the correction data.
In an embodiment, the speech synthesis playing apparatus provided in the embodiment of the present invention further includes a data storage module, configured to:
storing the text to be synthesized, the synthesized speech, and/or the updated synthesized speech in the distributed system.
Example six,
The embodiment of the invention also provides a user terminal which can be a mobile phone, a tablet computer, a notebook computer and other equipment. Fig. 7 is a schematic diagram showing a structure of a user terminal according to an embodiment of the present invention, specifically:
the user terminal may include components such as a processor 701 of one or more processing cores, memory 702 of one or more computer-readable storage media, a power supply 703, and an input unit 704. Those skilled in the art will appreciate that the user terminal architecture shown in fig. 7 does not constitute a limitation of the user terminal and may include more or fewer components than shown, or some components may be combined, or a different arrangement of components. Wherein:
the processor 701 is a control center of the user terminal, connects various parts of the entire user terminal using various interfaces and lines, and performs various functions of the user terminal and processes data by operating or executing software programs and/or modules stored in the memory 702 and calling data stored in the memory 702.
The memory 702 may be used to store software programs and modules, and the processor 701 executes various functional applications and data processing by operating the software programs and modules stored in the memory 702. Further, the memory 702 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 702 may also include a memory controller to provide the processor 701 with access to the memory 702.
The user terminal further includes a power source 703 for supplying power to each component, and preferably, the power source 703 may be logically connected to the processor 701 through a power management system, so as to implement functions of managing charging, discharging, and power consumption through the power management system.
The user terminal may further include an input unit 704, and the input unit 704 may be used to receive input numeric or character information and generate a keyboard, mouse, joystick, optical or trackball signal input in relation to user settings and function control.
Although not shown, the user terminal may further include a display unit and the like, which will not be described in detail herein. Specifically, in this embodiment, the processor 701 in the user terminal loads the executable file corresponding to the process of one or more application programs into the memory 702 according to the following instructions, and the processor 701 runs the application program stored in the memory 702, thereby implementing various functions as follows:
receiving a voice synthesis request, and acquiring a text to be synthesized, which needs to be subjected to voice synthesis, according to the voice synthesis request; sending the text to be synthesized to a voice synthesis server for voice synthesis, so that the voice synthesis server returns the synthesized voice corresponding to the text to be synthesized; playing the synthesized voice and receiving a pronunciation correction request for the synthesized voice; receiving correction data corresponding to the synthesized voice according to the pronunciation correction request, and sending the correction data to the voice synthesis server, so that the voice synthesis server updates the synthesized voice according to the correction data and returns the updated synthesized voice; and replacing the currently played synthesized voice with the updated synthesized voice for playing.
It should be noted that the user terminal provided in the embodiment of the present invention and the speech synthesis playing method applied to the user terminal in the above embodiment belong to the same concept, and the specific implementation process thereof is described in the above embodiment of the method, and is not described herein again.
Example seven,
An embodiment of the present invention further provides a speech synthesis server, as shown in fig. 8, which shows a schematic structural diagram of a user terminal according to an embodiment of the present invention, specifically:
the speech synthesis server may include components such as a processor 801 of one or more processing cores, memory 802 of one or more computer-readable storage media, a power supply 803, and an input unit 804. Those skilled in the art will appreciate that the speech synthesis server architecture shown in FIG. 8 is not intended to be limiting of speech synthesis servers and may include more or fewer components than shown, or some components may be combined, or a different arrangement of components. Wherein:
the processor 801 is a control center of the speech synthesis server, connects various parts of the entire speech synthesis server using various interfaces and lines, and performs various functions of the speech synthesis server and processes data by running or executing software programs and/or modules stored in the memory 802 and calling data stored in the memory 802.
The memory 802 may be used to store software programs and modules, and the processor 801 executes various functional applications and data processing by operating the software programs and modules stored in the memory 802. Further, the memory 802 may include high speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 802 may also include a memory controller to provide the processor 801 access to the memory 802.
The speech synthesis server further comprises a power supply 803 for supplying power to the various components, preferably, the power supply 803 may be logically connected to the processor 801 via a power management system, so that functions of managing charging, discharging, and power consumption are performed via the power management system.
Specifically, in this embodiment, the processor 801 in the speech synthesis server loads the executable file corresponding to the process of one or more application programs into the memory 802 according to the following instructions, and the processor 801 runs the application programs stored in the memory 802, thereby implementing various functions as follows:
when a text to be synthesized from a user terminal is received, carrying out voice synthesis on the text to be synthesized according to a pre-trained voice synthesis model to obtain synthesized voice; the synthesized voice is returned to the user terminal for playing, and correction data corresponding to the synthesized voice returned by the user terminal is received; updating the synthesized voice according to the correction data to obtain the updated synthesized voice; and returning the updated synthesized voice to the user terminal, so that the user terminal replaces the synthesized voice with the updated synthesized voice to play.
It should be noted that the speech synthesis server provided in the embodiment of the present invention and the speech synthesis playing method applied to the speech synthesis server in the above embodiment belong to the same concept, and the specific implementation process thereof is described in the above embodiment of the method, and is not described herein again.
Example eight,
It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.
To this end, an embodiment of the present invention provides a storage medium, in which a plurality of instructions are stored, where the instructions can be loaded by a processor of a user terminal to execute a speech synthesis playing method applicable to the user terminal, where the instructions may perform the following steps, for example:
receiving a voice synthesis request, and acquiring a text to be synthesized, which needs to be subjected to voice synthesis, according to the voice synthesis request; sending the text to be synthesized to a voice synthesis server for voice synthesis, so that the voice synthesis server returns the synthesized voice corresponding to the text to be synthesized; playing the synthesized voice and receiving a pronunciation correction request for the synthesized voice; receiving correction data corresponding to the synthesized voice according to the pronunciation correction request, and sending the correction data to the voice synthesis server, so that the voice synthesis server updates the synthesized voice according to the correction data and returns the updated synthesized voice; and replacing the currently played synthesized voice with the updated synthesized voice for playing.
Furthermore, an embodiment of the present invention provides a storage medium, in which a plurality of instructions are stored, where the instructions can be loaded by a processor of a speech synthesis server to execute the speech synthesis playing method applicable to the server provided by the embodiment of the present invention, for example, the instructions may perform the following steps:
when a text to be synthesized from a user terminal is received, carrying out voice synthesis on the text to be synthesized according to a pre-trained voice synthesis model to obtain synthesized voice; the synthesized voice is returned to the user terminal for playing, and correction data corresponding to the synthesized voice returned by the user terminal is received; updating the synthesized voice according to the correction data to obtain the updated synthesized voice; and returning the updated synthesized voice to the user terminal, so that the user terminal replaces the synthesized voice with the updated synthesized voice to play.
Wherein the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.
The storage medium provided in the embodiment of the present invention can achieve the beneficial effects that can be achieved by the corresponding speech synthesis playing method provided in the embodiment of the present invention, which are detailed in the foregoing embodiments and will not be described herein again.
The speech synthesis playing method, the speech synthesis playing device and the storage medium provided by the embodiment of the present invention are described in detail above, and a specific example is applied in the text to explain the principle and the implementation of the present invention, and the description of the above embodiment is only used to help understanding the method and the core idea of the present invention; meanwhile, for those skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (10)

1. A method for synthesizing and playing speech, comprising:
receiving a voice synthesis request, and acquiring a text to be synthesized, which needs to be subjected to voice synthesis, according to the voice synthesis request;
sending the text to be synthesized to a voice synthesis server for voice synthesis, so that the voice synthesis server returns a synthesized voice corresponding to the text to be synthesized;
playing the synthesized voice and receiving a pronunciation correction request for the synthesized voice;
receiving input correction data corresponding to the synthesized voice according to the pronunciation correction request, and sending the correction data to the voice synthesis server, so that the voice synthesis server updates the synthesized voice according to the correction data and returns the updated synthesized voice;
and replacing the currently played synthesized voice with the updated synthesized voice for playing.
2. The speech synthesis playing method according to claim 1, wherein the step of receiving correction data corresponding to the synthesized speech input according to the utterance correction request includes:
displaying a pronunciation correction interface according to the pronunciation correction request, wherein the pronunciation correction interface comprises a character input control and a pronunciation control;
receiving target words needing to be corrected in the text to be synthesized based on the word input control;
receiving a target pronunciation corresponding to the target character based on the pronunciation control;
setting the target word and the target pronunciation as the correction data.
3. The speech synthesis playing method according to claim 2, wherein the step of receiving a target pronunciation corresponding to the target word based on the pronunciation control comprises:
checking whether the target character is a polyphone character;
when the target character is judged to be a polyphone character, acquiring a plurality of pronunciations corresponding to the target character according to a preset corresponding relation between the polyphone character and the pronunciations;
displaying the plurality of pronunciations based on the pronunciation control, and receiving selection operation of the displayed pronunciations;
and setting the pronunciation corresponding to the selection operation as the target pronunciation of the target character.
4. The method for playing speech synthesis according to claims 1-3, wherein the step of obtaining the text to be synthesized that needs to be speech-synthesized according to the speech synthesis request comprises:
extracting a text in the display content of the foreground application according to the voice synthesis request to obtain an extracted text;
dividing the extracted text into a plurality of clauses according to a preset clause strategy;
and setting the clauses as the texts to be synthesized.
5. The speech synthesis playing method according to any one of claims 1 to 3, wherein the speech synthesis playing method further comprises:
and storing the text to be synthesized, the synthesized voice and/or the updated synthesized voice into a distributed system.
6. A method for synthesizing and playing speech, comprising:
when a text to be synthesized from a user terminal is received, carrying out voice synthesis on the text to be synthesized according to a pre-trained voice synthesis model to obtain synthesized voice;
returning the synthesized voice to the user terminal for playing, and receiving correction data corresponding to the synthesized voice returned by the user terminal;
updating the synthesized voice according to the correction data to obtain updated synthesized voice;
and returning the updated synthesized voice to the user terminal, so that the user terminal replaces the synthesized voice with the updated synthesized voice for playing.
7. The speech synthesis playing method according to claim 6, further comprising:
and updating the voice synthesis model according to the text to be synthesized and the correction data.
8. A speech synthesis playback apparatus, comprising:
the text acquisition module is used for receiving a voice synthesis request and acquiring a text to be synthesized, which needs voice synthesis, according to the voice synthesis request;
the voice synthesis module is used for sending the text to be synthesized to a voice synthesis server for voice synthesis, so that the voice synthesis server returns the synthesized voice corresponding to the text to be synthesized;
the voice playing module is used for playing the synthesized voice and receiving a pronunciation correction request of the synthesized voice;
the text correction module is used for receiving input correction data corresponding to the synthesized voice according to the pronunciation correction request and sending the correction data to the voice synthesis server, so that the voice synthesis server updates the synthesized voice according to the correction data and returns the updated synthesized voice;
the voice playing module is further configured to replace the currently played synthesized voice with the updated synthesized voice for playing.
9. A speech synthesis playing device is characterized in that the device comprises a speech synthesis module, a speech sending module and a speech updating module, wherein,
the voice synthesis module is used for carrying out voice synthesis on the text to be synthesized according to a pre-trained voice synthesis model when receiving the text to be synthesized from the user terminal to obtain synthesized voice;
the voice issuing module is used for returning the synthesized voice to the user terminal for playing and receiving the correction data corresponding to the text to be synthesized returned by the user terminal;
the voice updating module is used for updating the synthesized voice according to the correction data to obtain the updated synthesized voice;
the voice issuing module is further configured to return the updated synthesized voice to the user terminal, so that the user terminal replaces the synthesized voice with the updated synthesized voice to play the synthesized voice.
10. A storage medium storing instructions adapted to be loaded by a processor to perform a speech synthesis playing method according to any one of claims 1 to 5 or to perform a speech synthesis playing method according to claim 6 or 7.
CN201910848598.2A 2019-09-09 2019-09-09 Voice synthesis playing method and device and storage medium Pending CN110600004A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910848598.2A CN110600004A (en) 2019-09-09 2019-09-09 Voice synthesis playing method and device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910848598.2A CN110600004A (en) 2019-09-09 2019-09-09 Voice synthesis playing method and device and storage medium

Publications (1)

Publication Number Publication Date
CN110600004A true CN110600004A (en) 2019-12-20

Family

ID=68858178

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910848598.2A Pending CN110600004A (en) 2019-09-09 2019-09-09 Voice synthesis playing method and device and storage medium

Country Status (1)

Country Link
CN (1) CN110600004A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111145724A (en) * 2019-12-31 2020-05-12 出门问问信息科技有限公司 Polyphone marking method and device and computer readable storage medium
CN111930900A (en) * 2020-09-28 2020-11-13 北京世纪好未来教育科技有限公司 Standard pronunciation generating method and related device
CN112037756A (en) * 2020-07-31 2020-12-04 北京搜狗科技发展有限公司 Voice processing method, apparatus and medium
WO2021135713A1 (en) * 2019-12-30 2021-07-08 华为技术有限公司 Text-to-voice processing method, terminal and server

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130179170A1 (en) * 2012-01-09 2013-07-11 Microsoft Corporation Crowd-sourcing pronunciation corrections in text-to-speech engines
CN103366731A (en) * 2012-03-31 2013-10-23 盛乐信息技术(上海)有限公司 Text to speech (TTS) method and system
CN104142909A (en) * 2014-05-07 2014-11-12 腾讯科技(深圳)有限公司 Method and device for phonetic annotation of Chinese characters
CN105609096A (en) * 2015-12-30 2016-05-25 小米科技有限责任公司 Text data output method and device
CN106205600A (en) * 2016-07-26 2016-12-07 浪潮电子信息产业股份有限公司 One can Chinese text speech synthesis system and method alternately
CN109461436A (en) * 2018-10-23 2019-03-12 广东小天才科技有限公司 A kind of correcting method and system of speech recognition pronunciation mistake

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130179170A1 (en) * 2012-01-09 2013-07-11 Microsoft Corporation Crowd-sourcing pronunciation corrections in text-to-speech engines
CN103366731A (en) * 2012-03-31 2013-10-23 盛乐信息技术(上海)有限公司 Text to speech (TTS) method and system
CN104142909A (en) * 2014-05-07 2014-11-12 腾讯科技(深圳)有限公司 Method and device for phonetic annotation of Chinese characters
CN105609096A (en) * 2015-12-30 2016-05-25 小米科技有限责任公司 Text data output method and device
CN106205600A (en) * 2016-07-26 2016-12-07 浪潮电子信息产业股份有限公司 One can Chinese text speech synthesis system and method alternately
CN109461436A (en) * 2018-10-23 2019-03-12 广东小天才科技有限公司 A kind of correcting method and system of speech recognition pronunciation mistake

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021135713A1 (en) * 2019-12-30 2021-07-08 华为技术有限公司 Text-to-voice processing method, terminal and server
CN113129861A (en) * 2019-12-30 2021-07-16 华为技术有限公司 Text-to-speech processing method, terminal and server
CN111145724A (en) * 2019-12-31 2020-05-12 出门问问信息科技有限公司 Polyphone marking method and device and computer readable storage medium
CN111145724B (en) * 2019-12-31 2022-08-19 出门问问信息科技有限公司 Polyphone marking method and device and computer readable storage medium
CN112037756A (en) * 2020-07-31 2020-12-04 北京搜狗科技发展有限公司 Voice processing method, apparatus and medium
CN111930900A (en) * 2020-09-28 2020-11-13 北京世纪好未来教育科技有限公司 Standard pronunciation generating method and related device
CN111930900B (en) * 2020-09-28 2021-09-21 北京世纪好未来教育科技有限公司 Standard pronunciation generating method and related device

Similar Documents

Publication Publication Date Title
CN110600004A (en) Voice synthesis playing method and device and storage medium
CN112687259B (en) Speech synthesis method, device and readable storage medium
JP4246790B2 (en) Speech synthesizer
US8311830B2 (en) System and method for client voice building
JP2018537727A5 (en)
WO2021227707A1 (en) Audio synthesis method and apparatus, computer readable medium, and electronic device
CN102549653A (en) Speech translation system, first terminal device, speech recognition server device, translation server device, and speech synthesis server device
CN111091800A (en) Song generation method and device
CN111161695B (en) Song generation method and device
CN114242033A (en) Speech synthesis method, apparatus, device, storage medium and program product
CN112669815B (en) Song customization generation method and corresponding device, equipment and medium thereof
CN114627856A (en) Voice recognition method, voice recognition device, storage medium and electronic equipment
CN110289015A (en) A kind of audio-frequency processing method, device, server, storage medium and system
CN112035699A (en) Music synthesis method, device, equipment and computer readable medium
CN113838443A (en) Audio synthesis method and device, computer-readable storage medium and electronic equipment
CN111354325A (en) Automatic word and song creation system and method thereof
CN113077783A (en) Method and device for amplifying Chinese speech corpus, electronic equipment and storage medium
CN114528812A (en) Voice recognition method, system, computing device and storage medium
KR102020341B1 (en) System for realizing score and replaying sound source, and method thereof
CN115050351A (en) Method and device for generating timestamp and computer equipment
CN114783408A (en) Audio data processing method and device, computer equipment and medium
CN114822492B (en) Speech synthesis method and device, electronic equipment and computer readable storage medium
JP7166370B2 (en) Methods, systems, and computer readable recording media for improving speech recognition rates for audio recordings
TWI836255B (en) Method and apparatus in designing a personalized virtual singer using singing voice conversion
CN113488010B (en) Music data generation method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination