CN114678001A - Speech synthesis method and speech synthesis device - Google Patents

Speech synthesis method and speech synthesis device Download PDF

Info

Publication number
CN114678001A
CN114678001A CN202210344448.XA CN202210344448A CN114678001A CN 114678001 A CN114678001 A CN 114678001A CN 202210344448 A CN202210344448 A CN 202210344448A CN 114678001 A CN114678001 A CN 114678001A
Authority
CN
China
Prior art keywords
prosodic
phoneme sequence
sequence
phoneme
sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210344448.XA
Other languages
Chinese (zh)
Inventor
高羽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Midea Group Co Ltd
Midea Group Shanghai Co Ltd
Original Assignee
Midea Group Co Ltd
Midea Group Shanghai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Midea Group Co Ltd, Midea Group Shanghai Co Ltd filed Critical Midea Group Co Ltd
Priority to CN202210344448.XA priority Critical patent/CN114678001A/en
Publication of CN114678001A publication Critical patent/CN114678001A/en
Priority to PCT/CN2022/118072 priority patent/WO2023184874A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

The application relates to the field of voice synthesis, and provides a voice synthesis method, which comprises the following steps: segmenting a prosodic phoneme sequence of a target text to generate a plurality of sentence sequences, wherein the prosodic phoneme sequence comprises a plurality of phonemes corresponding to the target text and prosodic identifiers positioned between adjacent phonemes, and each sentence sequence comprises at least one phoneme; performing voice synthesis on a first sub-prosodic phoneme sequence in the plurality of sentence sequences to obtain first voice information; and outputting the first speech information and performing speech synthesis on a second sub-prosodic phoneme sequence of the plurality of sentence sequences to generate second speech information, wherein the second sub-prosodic phoneme sequence is at least one sentence sequence located after the first sub-prosodic phoneme sequence in the prosodic phoneme sequence. The voice synthesis method effectively accelerates the feedback speed of the system after receiving the network voice synthesis service request, and shortens the waiting time of the user.

Description

Speech synthesis method and speech synthesis device
Technical Field
The present application relates to the field of speech synthesis technologies, and in particular, to a speech synthesis method and a speech synthesis apparatus.
Background
Text-To-Speech (TTS) technology is widely used in the field of Speech synthesis. In the related art, when speech synthesis is performed, speech synthesis is usually performed on the whole text to be synthesized directly, and for some longer texts to be synthesized, longer time is consumed when speech synthesis is performed, which also means that a user needs to wait for longer time to obtain synthesized speech, and the speech synthesis performance is lower, which wastes time of the user and affects the use experience of the user.
Disclosure of Invention
The present application is directed to solving at least one of the problems in the prior art. Therefore, the application provides a speech synthesis method.
The application also provides a voice synthesis device.
The application also provides an electronic device.
The present application also proposes a non-transitory computer-readable storage medium.
The present application also proposes a computer program product.
The speech synthesis method according to the embodiment of the first aspect of the application comprises the following steps:
segmenting a prosodic phoneme sequence of a target text to generate a plurality of sentence sequences, wherein the prosodic phoneme sequence comprises a plurality of phonemes corresponding to the target text and prosodic identifiers located between adjacent phonemes, and each sentence sequence comprises at least one phoneme;
performing voice synthesis on a first sub-prosodic phoneme sequence in the plurality of sentence sequences to obtain first voice information;
and outputting the first voice information and performing voice synthesis on a second sub-prosodic phoneme sequence in the plurality of sentence sequences to generate second voice information, wherein the second sub-prosodic phoneme sequence is at least one sentence sequence located after the first sub-prosodic phoneme sequence in the prosodic phoneme sequence.
According to the voice synthesis method, the target text is divided into the multiple clause sequences, the first clause sequence is preferentially subjected to voice synthesis to generate the first voice information, and the subsequent clause sequences are continuously subjected to voice synthesis in the process of outputting the first voice information, so that the feedback speed of the system after receiving the network voice synthesis service request is effectively increased, the waiting time of a user is shortened, and the use experience of the user is improved.
According to an embodiment of the application, after the generating the second speech information, the method further comprises:
and combining the second voice information and the first voice information to generate third voice information.
According to an embodiment of the present application, before the segmenting the prosodic phoneme sequence of the target text, the method further includes: generating a target file size of the third speech information based on the prosodic phoneme sequence;
the generating of the second voice information includes: and generating the second voice information based on the target file size.
According to an embodiment of the present application, the generating a target file size of the third speech information based on the prosodic phoneme sequence includes:
generating a predicted file size of the third speech information based on the prosodic phoneme sequence;
and correcting the size of the predicted file based on a target residual value to generate the size of the target file, wherein the target residual value is determined based on the size of the sample file and the size of a sample audio file corresponding to the predicted sample text, and the size of the sample file is the actual size of the sample audio file corresponding to the sample text.
According to an embodiment of the present application, said combining the second speech information and the first speech information includes:
and combining the second voice information and the first voice information based on the phoneme duration corresponding to the second voice information and the phoneme duration corresponding to the first voice information.
According to an embodiment of the present application, before the segmenting the prosodic phoneme sequence of the target text, the method further includes:
acquiring a text to be synthesized;
and determining that the size of the text to be synthesized exceeds a target threshold, segmenting the text to be synthesized to generate the target text, wherein the size of the target text does not exceed the target threshold.
According to an embodiment of the present application, the segmenting the prosodic phoneme sequence of the target text to generate a plurality of sentence sequences includes:
converting the target text into the prosodic phoneme sequence;
segmenting the prosodic phoneme sequence based on at least a portion of the plurality of prosodic identifiers to generate the plurality of sentence sequences.
According to an embodiment of the present application, the parsing the prosodic phoneme sequence based on at least part of the plurality of prosodic identifiers to generate a plurality of sentence sequences includes:
determining a first cut location in the sequence of prosodic phonemes based on a plurality of the prosodic identifiers;
determining a second parsing position from the prosodic identifier in the prosodic phoneme sequence located after the first parsing position in the prosodic phoneme sequence;
and segmenting the prosodic phoneme sequence based on the first segmentation position and the second segmentation position to generate a first sub-prosodic phoneme sequence and at least two second sub-prosodic phoneme sequences, wherein the first sub-prosodic phoneme sequence is a prosodic phoneme sequence located before the first segmentation position in the prosodic phoneme sequence, the at least two second sub-prosodic phoneme sequences are prosodic phoneme sequences located after the first segmentation position in the prosodic phoneme sequence, the adjacent second sub-prosodic phoneme sequences are determined based on the second segmentation position, and the voice synthesis duration corresponding to the first sub-prosodic phoneme sequence is within a target duration.
According to an embodiment of the present application, the converting the target text into a prosodic phoneme sequence includes:
acquiring syllable, prosodic words, prosodic phrases, intonation phrases and sentence end information of the target text;
marking the target text based on at least two of the syllables, the prosodic words, the prosodic phrases, the intonation phrases and the sentence end information to generate the prosodic phoneme sequence.
According to an embodiment of the present application, the tagging the target text based on at least two of the syllable, the prosodic word, the prosodic phrase, the intonation phrase, and the sentence end information to generate the prosodic phoneme sequence includes:
converting the target text into a phoneme sequence;
generating the plurality of prosodic identifiers based on at least two of the syllables, the prosodic words, the prosodic phrases, the intonation phrases, and the period end information;
generating the prosodic phoneme sequence based on the plurality of prosodic identifiers marking the phoneme sequence.
The speech synthesis apparatus according to the embodiment of the second aspect of the present application includes:
the prosodic phoneme sequence comprises a plurality of phonemes corresponding to the target text and prosodic identifiers positioned between the adjacent phonemes, and each sentence sequence comprises at least one phoneme;
the second processing module is used for carrying out voice synthesis on a first sub-prosodic phoneme sequence in the plurality of sentence sequences to obtain first voice information;
a third processing module, configured to output the first speech information and perform speech synthesis on a second sub-prosodic phoneme sequence in the multiple sentence sequences to generate second speech information, where the second sub-prosodic phoneme sequence is at least one sentence sequence located after the first sub-prosodic phoneme sequence in the prosodic phoneme sequence.
An electronic device according to an embodiment of the third aspect of the present application includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor implements the speech synthesis method as described in any one of the above when executing the computer program.
A non-transitory computer-readable storage medium according to an embodiment of the fourth aspect of the present application, having stored thereon a computer program which, when executed by a processor, implements the speech synthesis method as described in any of the above.
A computer program product according to an embodiment of the fifth aspect of the present application comprises a computer program which, when executed by a processor, implements the speech synthesis method as described in any of the above.
One or more technical solutions in the embodiments of the present application have at least one of the following technical effects:
the target text is segmented into a plurality of clause sequences, the first clause sequence is preferentially subjected to voice synthesis to generate first voice information, and the subsequent clause sequences are continuously subjected to voice synthesis in the process of outputting the first voice information, so that the feedback speed of the system after receiving a network voice synthesis service request is effectively increased, the waiting time of a user is shortened, and the use experience of the user is facilitated to be improved.
Furthermore, the text to be synthesized is segmented based on the target threshold value to generate the target text, so that the actual capacity of the server can be fully considered, the target text in the processing capacity range of the server is provided for voice synthesis, and the performance of voice synthesis is improved.
Furthermore, a more refined prosody representation can be provided by converting the target text into a phoneme sequence and marking the phoneme sequence based on prosody identifiers corresponding to at least two of sentence end information, intonation phrases, prosody words and syllables to generate the prosody phoneme sequence, so that the segmentation fineness and accuracy in the subsequent segmentation process can be improved.
Furthermore, the size information of the target audio file synthesized by the target text is predicted based on the prosodic phoneme sequence, and the predicted value is corrected based on the target residual value, so that the prediction of the size value of the target file can be realized before the target audio file is generated, and the accuracy and precision of the prediction result are high.
Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic flow chart of a speech synthesis method provided in an embodiment of the present application;
fig. 2 is a second schematic flowchart of a speech synthesis method according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of a speech synthesis apparatus provided in an embodiment of the present application;
fig. 4 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.
Detailed Description
Embodiments of the present application will be described in further detail below with reference to the drawings and examples. The following examples are intended to illustrate the present application but are not intended to limit the scope of the present application.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of an embodiment of the application. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
The speech synthesis method of the embodiment of the present application is described below with reference to fig. 1 to 2.
The execution main body of the voice synthesis method can be a voice synthesis device, or a server, or also can be a terminal of a user, including but not limited to a mobile phone, a tablet computer, a pc terminal, a vehicle-mounted terminal, a household intelligent appliance and the like.
As shown in fig. 1, the speech synthesis method includes: step 110, step 120 and step 130.
Step 110, segmenting prosodic phoneme sequences of the target text to generate a plurality of sentence sequences;
in this step, the target text is the text currently used for speech synthesis.
The prosodic phoneme sequence is a sequence for characterizing prosodic and phoneme characteristics of the target text.
The prosodic phoneme sequence includes a prosodic identifier located between adjacent phonemes and a plurality of phonemes corresponding to the target text.
The phoneme can be a combination of one or more phonetic units divided according to natural attributes of the speech, and the phonetic unit can be pinyin, initial consonant or final sound corresponding to a Chinese character or an English word, English phonetic symbol or English letter.
The prosodic identifier is an identifier for characterizing prosodic features corresponding to each phoneme in the target text, and the prosodic features include, but are not limited to: the phoneme corresponds to the characteristics of tone, syllable, prosodic word, prosodic phrase, intonation phrase, silence, pause and the like.
The fine granularity of the prosodic identifier used for representing the pause is higher than that of the identifier used for representing the prosody of the intonation phrase, the fine granularity used for representing the intonation phrase is higher than that of the prosodic phrase, the fine granularity used for representing the prosodic phrase is higher than that of the prosodic word, and the fine granularity used for representing the prosodic word is higher than that of the syllable.
In actual implementation, different symbols may be used to represent prosodic features of different fine-grained levels.
For example, for the target text "shanghai city three to four days cloudy southeast wind today", it can be converted into a prosodic phoneme sequence: sil shang4#0hai3#0shi4#2jin1#0 tianan 1#2yin1#0zhuan3#1duo1#0yun2#3dong1#0nan2#0feng1#2san1#0dao4#1si4#0ji2#4 sil.
It is to be understood that, for the prosodic phoneme sequence, the prosodic identifier may include: numbers, symbols and English phonemes between adjacent Pinyin; the phonemes may include the pinyin corresponding to each chinese character.
Where sil is silence in the prosodic phoneme sequence representing the beginning and end of a sentence, #0 represents a syllable, #1 represents a prosodic word, #2 represents a prosodic phrase, #3 represents a intonation phrase, and #4 represents the end of a sentence, and the number following each phoneme represents the tone of the phoneme, e.g., 4 in shang4 represents the fourth tone of the pinyin "shang".
It is understood that a whole prosodic phoneme sequence is formed by connecting sentence sequences which are connected in sequence.
For each prosodic phoneme sequence, corresponding to at least one segmentation point, at least two sentence sequences can be obtained.
In some embodiments, step 110 may include:
converting the target text into a prosodic phoneme sequence;
a plurality of sentence sequences is generated based on at least a portion of the plurality of prosodic identifiers.
In this embodiment, the target text is the text currently used for speech synthesis.
The prosodic phoneme sequence is a sequence for characterizing prosodic features of the target text.
The prosodic phoneme sequence includes a prosodic identifier located between adjacent phonemes and a plurality of phonemes corresponding to the target text.
The phoneme can be a combination of one or more character units, and the character unit can be pinyin corresponding to a Chinese character or an English word.
The prosodic identifier is an identifier for characterizing a prosodic feature corresponding to each phoneme in the target text, and the prosodic feature includes, but is not limited to: the phoneme corresponds to the characteristics of tone, syllable, prosodic word, prosodic phrase, intonation phrase, silence, pause and the like.
The fine granularity of the prosodic identifier used for representing the pause is larger than that of the identifier used for representing the prosody of the intonation phrase, the fine granularity used for representing the intonation phrase is larger than that used for representing the prosodic phrase, the fine granularity used for representing the prosodic phrase is larger than that used for representing the prosodic word, and the fine granularity used for representing the prosodic word is larger than that used for representing the syllable.
In actual implementation, different symbols may be used to represent prosodic features of different fine-grained levels.
For example, for the target text "shanghai city three to four days cloudy southeast wind today", it can be converted into a prosodic phoneme sequence: sil shang4#0hai3#0shi4#2jin1#0 tianan 1#2yin1#0zhuan3#1duo1#0yun2#3dong1#0nan2#0feng1#2san1#0dao4#1si4#0ji2#4 sil.
It is to be understood that, for the prosodic phoneme sequence, the prosodic identifier may include: the combination of numbers and symbols between adjacent pinyins and specific English characters; the phonemes may include the pinyin corresponding to each chinese character.
Where sil is silence representing beginning and end of a prosodic phoneme sequence, #0 represents syllable, #1 represents prosodic word, #2 represents prosodic phrase, #3 represents intonation phrase, and #4 represents end of a sentence, and the number following each phoneme represents the tone of the phoneme, e.g., 4 in shang4 represents the fourth tone of pinyin "shang".
The whole prosodic phoneme sequence comprises a plurality of phonemes and a plurality of prosodic identifiers, and the plurality of prosodic identifiers comprise prosodic identifiers corresponding to different fine granularity levels.
In an actual execution process, a suitable fine-grained level can be selected as a segmentation standard based on an actual situation, the position of the prosodic identifier corresponding to the fine-grained level in the prosodic phoneme sequence is used as a segmentation point, and the prosodic phoneme sequence is segmented to obtain a plurality of sentence sequences.
It should be noted that each sentence sequence includes a prosody identifier at the segmentation point and at least one phoneme.
It is understood that, for each prosodic phoneme sequence, at least one cut point is associated, and at least two sentence sequences are obtained.
For example, for the prosodic phoneme sequence "sil shang4#0hai3#0shi4#2jin1#0 tianan 1#2yin1#0 zhuman 3#1duo1#0yun2#3dong1#0nan2#0feng1#2san1#0dao4#1si4#0ji2#4 sil", the segmentation is determined to be performed at #3 based on actual requirements, the segmentation is performed at positions containing #3 in the prosodic phoneme sequence, and the prosodic segmentation characters #3 are reserved to the previous splicing unit, so that the prosodic phoneme sequence can be segmented into the following sentence sequences:
sentence sequence 1: sil shang4#0hai3#0shi4#2jin1#0 tianan 1#2yin1#0zhuan3#1duo1#0yun2# 3;
sentence sequence 2: dong1#0nan2#0feng1#2san1#0dao4#1si4#0ji2#4 si.
The specific determination process of the segmentation point will be described in the following embodiments, and will not be described herein again.
Step 120, performing speech synthesis on a first sub-prosodic phoneme sequence in the plurality of sentence sequences to obtain first speech information;
in this step, the first sub-prosodic phoneme sequence is a prosodic phoneme sequence before a first cut point of the prosodic phoneme sequence.
Continuing with the example of the sentence dividing sequence 1 and the sentence dividing sequence 2 in the above embodiment, the first sub-prosodic phoneme sequence is the sentence dividing sequence 1: sil shang4#0hai3#0shi4#2jin1#0 tianan 1#2yin1#0zhuan3#1duo1#0yun2# 3.
In an actual implementation, a vocoder may be used to perform speech synthesis on the first sub-prosodic phoneme sequence to generate first speech information corresponding to the first sub-prosodic phoneme sequence.
Step 130, outputting the first speech information and performing speech synthesis on a second sub-prosodic phoneme sequence in the plurality of sentence sequences to generate second speech information, wherein the second sub-prosodic phoneme sequence is at least one sentence sequence located after the first sub-prosodic phoneme sequence in the prosodic phoneme sequence.
In the step, after the first voice message is generated, the first voice message is returned to the client side for outputting, so that the user can play the first voice message.
And in the process of outputting the first voice information, the background continues to perform voice synthesis on the second sub-prosodic phoneme sequence so as to generate second voice information corresponding to the second sub-prosodic phoneme sequence.
Wherein the second sub prosodic phoneme sequence is at least one sentence sequence located after the first sub prosodic phoneme sequence in the prosodic phoneme sequence, and for the sentence sequence 1 and the sentence sequence 2 in the above embodiment, the second sub prosodic phoneme sequence is the sentence sequence 2: dong1#0nan2#0feng1#2san1#0dao4#1si4#0ji2#4 si.
In other embodiments, the method may further comprise:
determining that any clause sequence to be matched in the multiple clause sequences is matched with a cached target clause sequence, acquiring target clause voice corresponding to the target clause sequence from a cache, and determining the voice corresponding to the clause sequence to be matched as the target clause voice;
and determining that any sentence sequence to be matched in the plurality of sentence sequences is not matched with the cached target sentence sequence, and performing voice synthesis on the sentence sequence to be matched to generate second sentence voice.
For example, a sentence sequence matching the first sub-prosodic phoneme sequence or the second sub-prosodic phoneme sequence is matched from a plurality of sentence sequences buffered in advance, and a speech corresponding to the sentence sequence determined by matching and generated in advance and buffered is acquired, so that a synthesized speech corresponding to the first sub-prosodic phoneme sequence or the second sub-prosodic phoneme sequence is obtained. Therefore, the corresponding voice is matched from the cache without real-time synthesis, and the voice synthesis efficiency is improved.
In some embodiments, the method may further comprise: segmenting prosodic phoneme sequences of a target text to generate a plurality of candidate sequences; and combining a target candidate sequence in the candidate sequences with an adjacent candidate sequence to generate a fine granularity corresponding to the clause sequence and a plurality of clause sequences.
For example, a sentence "hope that this song | can let you like | play | XX for you" is a prosodic phoneme sequence "sil xi1#0wang4#1zhe4#0shou3#0ge1#3| neng2#0rang4#1ni2#1xi3#0hu an1#1wei4#0nin2#1bo1#0fang4# 1| EH 1K S #10EH 1K S #0de5#1EH 1K S #4 sil" to get a plurality of candidate sequences: "xi 1#0wang4#1zhe4#0shou3#0ge1# 3", "neng 2#0rang4#1ni2#1xi3#0hu 1# 1", "wei 4#0nin2#1bo1#0fang4# 1" and "EH 1K S #10EH 1K S #0de5#1EH 1K S # 4".
Any candidate sequence in the candidate sequences is taken as a target candidate sequence, and is combined with other adjacent candidate sequences, so that the following multiple clause sequences (one clause sequence is between two "|") can be obtained:
“sil xi1#0wang4#1zhe4#0shou3#0ge1#3|neng2#0rang4#1ni2#1xi3#0huan1#1|wei4#0nin2#1bo1#0fang4#1|EH1 K S#10EH1 K S#0de5#1EH1 K S#4sil”
“sil xi1#0wang4#1zhe4#0shou3#0ge1#3neng2#0rang4#1ni2#1xi3#0huan1#1|wei4#0nin2#1bo1#0fang4#1|EH1 K S#10EH1 K S#0de5#1EH1K S#4sil”
“sil xi1#0wang4#1zhe4#0shou3#0ge1#3|neng2#0rang4#1ni2#1xi3#0huan1#1wei4#0nin2#1bo1#0fang4#1|EH1 K S#10EH1 K S#0de5#1EH1 K S#4sil”
“sil xi1#0wang4#1zhe4#0shou3#0ge1#3neng2#0rang4#1ni2#1xi3#0huan1#1wei4#0nin2#1bo1#0fang4#1|EH1 K S#10EH1 K S#0de5#1EH1 KS#4sil”
“sil xi1#0wang4#1zhe4#0shou3#0ge1#3|neng2#0rang4#1ni2#1xi3#0huan1#1|wei4#0nin2#1bo1#0fang4#1EH1 K S#10EH1 K S#0de5#1EH1 K S#4sil”
“sil xi1#0wang4#1zhe4#0shou3#0ge1#3|neng2#0rang4#1ni2#1xi3#0huan1#1wei4#0nin2#1bo1#0fang4#1EH1 K S#10EH1 K S#0de5#1EH1K S#4sil”
“sil xi1#0wang4#1zhe4#0shou3#0ge1#3neng2#0rang4#1ni2#1xi3#0huan1#1wei4#0nin2#1bo1#0fang4#1EH1 K S#10EH1 K S#0de5#1EH1K S#4sil”
and performing descending sequencing on the multiple clause sequences based on the fine granularity corresponding to the clause sequences. Specifically, the larger the fine granularity is, the earlier the corresponding sentence sequence is, for example, "xi 1#0wang4#1zhe4#0shou3#0ge 3# 3neng 3# 0rang 3# 1ni 3# 1xi3#0huan 3# 1wei 3# 0nin 3# 3# 3 b 3# 1bo 3# 0fang 3# 1# 3# 10EH 3# 0de 3# 1EH 3# 4 (hope that the song can make you like you play XX)" the you in "xi 3# 0wang 3# 3", "3 n3# 3# 3n 3# 33 # 3# 33 #0 ng 3# 4# 3# 4 n" can be liked by you "you can be you in the head of the you can be played.
In some embodiments, the multiple sentence sequences may be sequentially matched with the target sentence sequence in the cache from front to back based on the descending order, and the voice of the target sentence sequence which is successfully matched is determined as the voice of the sentence sequence.
In this embodiment, the sentence sequence is matched exactly with the target sentence sequence in order from front to back based on the order of the generated descending sort, for example, the sentence sequence "xi 1#0wang4#1zhe4#0shou3#0ge1#3neng2#0rang4#1ni2#1xi3#0hu 1#1wei4#0nin2#1bo1#0fang4#1EH 1K S #10EH 1K S #0de5#1EH 1K S #4 (it is expected that this song can let you like to play X)" exactly match the target sentence, and in case of successful matching, the sentence sequence is determined as the first sentence sequence, and the target sentence matching with the first sentence sequence is determined as the first sentence sequence to which the speech phrase should be compared, and the comparison is ended.
In case of unsuccessful matching, the sentence sequence "xi 1#0wang4#1zhe4#0shou3#0ge1#3neng2#0rang4#1ni2#1xi3#0hu an1#1wei4#0nin2#1bo1#0fang4#1 (hope that this song will like you play)" is compared with the target sentence, and the above process is repeated until it is determined that a certain sentence sequence and the target sentence can achieve exact matching, and the comparison is ended.
In the development process, the applicant finds that, in the related art, when performing speech synthesis, speech synthesis is usually performed directly on the whole text to be synthesized. Considering that speech has a temporal characteristic, the time for system conversion is generally proportional to the length of the input text, and the longer the time required for sentence synthesis.
For input of dozens to hundreds of word levels, the synthesis time of the fastest deep learning model is also in the level of several seconds to dozens of seconds at present, for some longer texts to be synthesized, longer time is needed for speech synthesis, which means that a user needs to wait for longer time to obtain synthesized speech, thus not only wasting the time of the user, but also influencing the use experience of the user.
In the research and development process of the applicant, in order to solve the above problems, the related art also has a method of segmenting a text to be synthesized into a plurality of sub-texts, and performing speech synthesis on the plurality of sub-texts in a system parallel synthesis manner, but the method is limited to the operation on the GPU, and on the CPU server, the method cannot improve the synthesis performance, and still needs to consume a lot of time.
In the application, the target text is segmented into a plurality of sentence sequences, the first sub-prosodic phoneme sequence is preferentially subjected to voice synthesis, the first voice information synthesized by the first sub-prosodic phoneme sequence is preferentially output, and the sentence sequences behind the first sub-prosodic phoneme sequence are synthesized in the process of outputting the first voice information, so that the feedback speed of the system after receiving the network voice synthesis service request is effectively increased, the waiting time of a user is shortened, and the use experience of the user is improved.
In an actual implementation process, the speech synthesis is performed on the second sub-prosodic phoneme sequence in the multiple sentence sequences, which may be expressed as performing speech synthesis on each sentence sequence in sequence based on the segmentation order of each sentence sequence in the target text.
For example, after the target text is sequentially divided into a first sentence division sequence, a second sentence division sequence and a third sentence division sequence, the first sentence division sequence is a first sub-prosodic phoneme sequence, and the first sentence division sequence is preferentially subjected to speech synthesis to generate first speech information; and performing voice synthesis on the second sentence sequence while outputting the first voice information, and performing voice synthesis on the third sentence sequence after generating the second voice information corresponding to the second sentence sequence.
The speech synthesis of the second sub-prosodic phoneme sequence among the plurality of sentence sequences may also be expressed as speech synthesis of each sentence sequence at the same time.
For example, after the target text is sequentially divided into a first sentence division sequence, a second sentence division sequence and a third sentence division sequence, the first sentence division sequence is a first sub-prosodic phoneme sequence, and the first sentence division sequence is preferentially subjected to speech synthesis to generate first speech information; and simultaneously outputting the first voice information, and synthesizing the second sentence sequence and the third sentence sequence in parallel by utilizing the parallel synthesis capability of the system.
According to the voice synthesis method provided by the embodiment of the application, the target text is segmented into the multiple clause sequences, the first clause sequence is preferentially subjected to voice synthesis to generate the first voice information, and the subsequent clause sequences are continuously subjected to voice synthesis in the process of outputting the first voice information, so that the feedback speed of the system after receiving the network voice synthesis service request is effectively increased, the waiting time of a user is shortened, and the use experience of the user is improved.
As shown in fig. 2, according to some embodiments of the present application, before step 110, the method may further include:
acquiring a text to be synthesized;
and under the condition that the size of the text to be synthesized exceeds a target threshold value, segmenting the text to be synthesized to generate a target text, wherein the size of the target text is not larger than the target threshold value.
In this embodiment, the text to be synthesized is the original text that needs to be subjected to speech synthesis.
The text level of the text to be synthesized may be several tens to several hundreds of levels of conventional text, or may be thousands or tens of thousands of levels of ultra-long text.
The target threshold may be determined based on at least one of the computational power of the system and the upper limit of the capability of the speech synthesis model, for example, the target threshold may be determined to be in the range of several hundred words.
In the actual execution process, the size of the text to be synthesized is preferentially judged for the acquired text to be synthesized, the size of the text to be synthesized is compared with a target threshold value, and the whole text to be synthesized is directly determined as the target text under the condition that the size of the text to be synthesized does not exceed the target threshold value.
Under the condition that the size of the text to be synthesized exceeds a target threshold, segmenting the text to be synthesized to obtain multiple sections of first texts on the basis of at least one of the obtained calculation power of the system and the capability information of the voice synthesis model, so that the size of each section of first text does not exceed the target threshold, and determining the first section of text in the multiple sections of first texts as the target text.
According to the speech synthesis method provided by the embodiment of the application, the text to be synthesized is segmented based on the target threshold value to generate the target text, the actual capacity of the server can be fully considered, the target text in the processing capacity range of the server is provided for speech synthesis, and therefore the speech synthesis performance is improved.
In some embodiments, step 110 may include:
acquiring sentence end information, intonation phrases, prosodic words and syllables of a target text;
generating a prosodic phoneme sequence based on at least two tagged target texts of the sentence end information, the intonation phrase, the prosodic word, and the syllable.
In this embodiment, a syllable is a phonetic unit in the speech stream, which is also the most acoustically distinguishable phonetic unit for a person, e.g., a syllable may be for each Chinese character in the target text.
Prosodic words are a set of syllables that are closely related and pronounced together in the actual stream of speech.
The prosodic phrases are intermediate rhythm chunks between prosodic words and intonation phrases, and may include a plurality of prosodic words and prosodic words, and the plurality of prosodic words composing the prosodic phrases sound to share a rhythm group.
The intonation phrases are sentences formed by connecting a plurality of prosodic phrases according to a certain intonation pattern and are used for representing larger pauses.
The sentence end information is used to characterize the end of each long sentence.
For example, for a target text "Shanghai city, cloudy today, southeast wind, three to four grades", wherein each Chinese character such as "Shanghai", "Hai" and "city" is a syllable corresponding to the target text; the words such as "Shanghai city", "today" and "cloudy-to-cloudy" or the phrases composed of the words are prosodic phrases corresponding to the target text; the sentence "shanghai city is cloudy-today" composed of prosodic phrases "shanghai city", "today" and "cloudy-cloudy", is the intonation phrase corresponding to the target text.
After obtaining the information of sentence end information, intonation phrases, prosodic words, syllables and the like of the target text, marking the target text based on at least two of the information, and generating a prosodic phoneme sequence.
In the research and development process, the applicant finds that, in the related art, the prosody of a sentence is often represented by using punctuations in the sentence, for example, a sentence is segmented at a position where a comma or a period is located in the sentence, so as to obtain a plurality of clauses. On one hand, the method cannot meet the requirement of segmenting the text without the punctuation, on the other hand, the two ends after segmentation are unbalanced, and the segmentation effect is poor.
In the application, at least two items of sentence end information, intonation phrases, prosodic words and syllables are adopted to represent the prosody of the sentence, and the target text is segmented on the basis of the prosodic information, the prosodic phrases, the prosodic words and the prosody, so that the situation that the target text is cut off in the middle of one whole word is avoided, and the sentence segmentation obtained after segmentation is natural in pause and prosody.
In some embodiments, generating a prosodic phoneme sequence based on at least two of the sentence end information, the intonation phrase, the prosodic word, and the syllable, includes:
converting the target text into a phoneme sequence;
generating a plurality of prosodic identifiers based on at least two of sentence end information, intonation phrases, prosodic words, and syllables;
and marking the phoneme sequence based on the plurality of prosodic identifiers to generate a prosodic phoneme sequence.
In this embodiment, the phoneme sequence is a sequence formed by connecting pronunciation marks corresponding to each Chinese character or English in the target text, including pinyin, tone or English phonetic notation.
For example, for the target text "shanghai city three to four days cloudy southeast wind today", it can be converted into a sequence of phonemes: shang4 hai3shi4 jin1 tianan 1 yin1 zhuan3 duo1 yun2 dong1 nan2 feng1san1 dao 4si 4 ji 2.
The prosody identifier is an identifier for characterizing prosodic features corresponding to each phoneme in the target text, that is, the prosodic identifier is a symbol for characterizing sentence end information, intonation phrases, prosodic words, and syllables.
In a practical implementation, the prosodic identifier may be represented in the form of a combination of a special symbol and a number or a specific letter combination, for example, by "# 0", "# 1", "# 2", "# 3", and "# 4", respectively, with different combinations characterizing different fine granularity levels.
Such as: #0 represents a syllable, #1 represents a prosodic word, #2 represents a prosodic phrase, #3 represents a intonation phrase, and #4 represents a sentence end, and in this embodiment, the fine granularity is, in order from smaller to larger: #0 < #1 < #2 < #3 < # 4.
After obtaining the phoneme sequence and the prosody identifier corresponding to the target text, inserting the prosody identifier into a corresponding position in the phoneme sequence, for example, after inserting the prosody identifier #0 for characterizing the syllables into the pinyin corresponding to each syllable in the phoneme sequence, inserting the prosody identifier #2 for characterizing the prosody phrase into each sentence of the prosody phrase in the phoneme sequence, thereby converting the phoneme sequence into the prosody phoneme sequence.
For example, the phoneme sequence "shang 4 hai3shi4 jin1 tie 1 yin1 zhuan3 duo1 yun2 dong1 nan2 feng1san1 dao 4si 4 ji 2" is labeled with #0 "," #1 "," #2 "," #3 ", and" #4 ", respectively, to generate a prosodic phoneme sequence: sil shang4#0hai3#0shi4#2jin1#0 tianan 1#2yin1#0zhuan3#1duo1#0yun2#3dong1#0nan2#0feng1#2san1#0dao4#1si4#0ji2#4 sil.
Wherein sil represents silence of beginning and end of sentence.
According to the speech synthesis method provided by the embodiment of the application, a more refined prosody representation can be provided by converting the target text into the phoneme sequence and marking the phoneme sequence based on prosody identifiers corresponding to at least two of sentence end information, intonation phrases, prosodic words and syllables to generate the prosodic phoneme sequence, so that the segmentation fineness and accuracy in the subsequent segmentation process can be improved.
With continuing reference to fig. 2, according to some embodiments of the present application, the method may further comprise: generating a target file size of the third speech information based on the prosodic phoneme sequence;
step 130 may include: second speech information is generated based on the target file size.
In this embodiment, the third speech information is speech information generated by synthesizing speech information synthesized from at least two sentence sequences of a plurality of sentence sequences corresponding to the target text, where one of the at least two sentence sequences is the first sub-prosodic phoneme sequence.
The target file size is the predicted file size of the third speech information.
The size of the target file may be file volume information, or may also be voice length information of the third voice information, which is not limited in this application.
After the target file size of the third speech information is obtained, data padding is performed on the speech data generated based on the second sub-prosodic phoneme sequence based on the target file size, so that second speech information is generated.
In some embodiments, generating the target file size of the third speech information based on the prosodic phoneme sequence may include:
generating a predicted file size of the third speech information based on the prosodic phoneme sequence;
and correcting the size of the predicted file based on the target residual value to generate the size of the target file.
In this embodiment, the predicted file size is an initial file size value of uncorrected target text-synthesized speech predicted based on the prosodic phoneme sequence.
The target residual value is used for correcting the size of the predicted file so as to improve the accuracy of the size of the finally generated target file.
The target residual value is determined based on the size of the sample file and the size of the sample audio file corresponding to the predicted sample text, where the sample file size is the actual size of the sample audio file corresponding to the sample text. The target file size is a file size value of the target text-synthesized speech predicted based on the prosodic phoneme sequence and corrected. It will be appreciated that the accuracy of the target file size is higher than the predicted file size.
The target residual value is a predetermined value, for example, the target residual value may be the maximum residual value.
In this embodiment, the prediction file size is corrected by performing residual addition processing on the prediction file size, so that the accuracy of the finally generated target file size is improved.
In some embodiments, the target residual value may be determined by:
acquiring a sample text, a sample file size corresponding to the sample audio file and a sample audio file corresponding to the sample text, wherein the sample audio file is generated by performing voice synthesis on the sample text;
converting the sample text into a sample prosodic phoneme sequence;
predicting the size of the sample audio file based on the sample prosodic phoneme sequence to generate the size of a sample predicted file of the sample audio file;
and determining the absolute value of the difference value between the sample file size and the sample prediction file size as a target residual value.
In this embodiment, the sample text may be regular text of several tens to several hundreds of levels, or may be ultra-long text of several thousands or tens of thousands levels.
The sample audio file is an audio file finally generated by performing speech synthesis on the sample text.
The sample file size is the actual size value or actual audio duration of the sample audio file.
For example, a speech synthesis system may be employed to calculate the actual wav file size or audio duration of a sample audio file corresponding to the sample text.
The sample prediction file size is the size value or audio duration of the sample audio file that is predicted and uncorrected.
The sample prediction file size should be generated in a manner consistent with the generation of the prediction file size.
And calculating the absolute value of the difference value of the sample prediction file size minus the sample file size as a target residual value.
It will be appreciated that in implementation, the sample prosodic phoneme sequence may be predicted multiple times to obtain multiple sample prediction file sizes. Respectively calculating the difference value between the size of each sample prediction file and the size of each sample file to obtain a plurality of candidate difference values; and then, selecting the absolute value of the minimum value from the candidate difference values, and determining the absolute value as a target residual value so as to improve the accuracy of the target residual value.
According to the speech synthesis method provided by the embodiment of the application, the size information of the target audio file synthesized by the target text is predicted based on the prosodic phoneme sequence, and the predicted value is corrected based on the target residual value, so that the prediction of the size value of the target file can be realized before the target audio file is generated, and the accuracy and precision of the prediction result are high.
With continued reference to fig. 2, in some embodiments, step 110 may include:
determining a first cut location in the sequence of prosodic phonemes based on the plurality of prosodic identifiers;
determining a second parsing position from among the positions of the prosodic phoneme sequence of the prosodic identifier located after the first parsing position in the prosodic phoneme sequence;
and segmenting the prosodic phoneme sequence based on the first segmentation position and the second segmentation position to generate a first sub-prosodic phoneme sequence and at least two second sub-prosodic phoneme sequences, wherein the first sub-prosodic phoneme sequence is a prosodic phoneme sequence located before the first segmentation position in the prosodic phoneme sequence, the at least two second sub-prosodic phoneme sequences are prosodic phoneme sequences located after the first segmentation position in the prosodic phoneme sequence, the adjacent second sub-prosodic phoneme sequences are determined based on the second segmentation position, and the voice synthesis duration corresponding to the first sub-prosodic phoneme sequence is within the target duration.
In this embodiment, the first slicing position is a slicing point for the first slicing.
The second cutting position is the position of the cutting point corresponding to all the other times except the first time.
Based on the first parsing position, the prosodic phoneme sequence may be parsed into two front and rear sub-sequences, and the sub-sequence located before the first parsing position may be determined as a first sub-prosodic phoneme sequence.
It should be noted that the first sub-prosodic phoneme sequence generated based on the first parsing position has a corresponding speech synthesis duration within the target duration.
The speech synthesis duration corresponding to the first sub-prosodic phoneme sequence is time consumed for synthesizing the first sub-prosodic phoneme sequence into speech.
The speech synthesis duration is related to the computational power of the speech synthesis system.
The target duration is a shorter duration, and the value of the target duration may be customized based on the user, or a default value of the system may be adopted, for example, the target duration may be set to 0.2s or 0.3 s.
After the first slicing position is determined, at least part of the prosodic identifiers after the first slicing position is searched for from the prosodic phoneme sequence as a candidate set for determining a second slicing position, and the position of the prosodic identifier in the candidate set is determined as the second slicing position.
It is to be understood that, in other embodiments, in the absence of the second parsing position, the second sub-prosodic phoneme sequence is the entire prosodic phoneme sequence after the first parsing position in the prosodic phoneme sequence. For example, when the second slicing position is a position corresponding to #3 but #3 is not found in the second sub-prosodic phoneme sequence, it may be understood that the second slicing position does not exist.
In this embodiment, the first cutting position is determined based on the prosodic identifier in the prosodic phoneme sequence, so that the corresponding speech synthesis duration of the first sub-prosodic phoneme sequence obtained based on the first cutting position can be within a reasonable duration range, thereby shortening the first sentence response time of the synthesis system and shortening the delay time; in addition, the first cutting position determined based on the mode is a position with a longer pause duration, so that the pause and the prosody of the first sub-prosodic phoneme sequence obtained by cutting are more natural, and the subsequently output voice synthesized based on the first sub-prosodic phoneme sequence is more natural and fluent.
With continued reference to fig. 2, after step 130, the method may further include, according to some embodiments of the present application:
and combining the first voice information and the second voice information to generate third voice information.
In this embodiment, the second speech information is speech information obtained by performing speech synthesis on a second sub-prosodic phoneme sequence, where the second sub-prosodic phoneme sequence may be one or more sentence sequences, and the second sub-prosodic phoneme sequences are located after the first sub-prosodic phoneme sequence in the target text.
For example, in the case of sequentially synthesizing the second speech information, while outputting the first speech information, speech synthesis may be performed on a second sentence sequence adjacent to the first sub-prosodic phoneme sequence, which is located after the first sub-prosodic phoneme sequence, to generate second speech information corresponding to the second sentence sequence, and while outputting the second speech information, the first speech information and the second speech information may be combined to generate third speech information.
For another example, in the case of sequentially synthesizing the second speech information, while outputting the first speech information, speech synthesis may be performed on a second sub-prosodic phoneme sequence adjacent to the first sub-prosodic phoneme sequence after the first sub-prosodic phoneme sequence to generate the second speech information, while outputting the second speech information, speech synthesis may be performed on a third sentence division sequence adjacent to the second sentence division sequence after the second sentence division sequence to generate the second speech information corresponding to the third sentence sequence, while outputting the second speech information, speech synthesis may be performed on subsequent sentence division sequences until second speech information corresponding to all sentences is generated, and then the first speech information and the second speech information corresponding to all sentences are synthesized to generate the third speech information.
In the case of synthesizing the second speech information in parallel, while outputting the first speech information, a plurality of sentence sequences located after the first sub-prosodic phoneme sequence may be synthesized in parallel, and second speech information corresponding to each sentence sequence may be generated, and then the first speech information and the obtained plurality of second speech information may be synthesized to generate third speech information.
In some embodiments, combining the first speech information and the second speech information may include: and combining the first voice information and the second voice information based on the phoneme duration corresponding to the first voice information and the phoneme duration corresponding to the second voice information.
In this embodiment, the duration of a phoneme is the duration of the pronunciation corresponding to the phoneme.
For example, for sentence sequence 1: the sil shang4#0hai3#0shi4#2jin1#0 tie 1#2yin1#0zhuan3#1duo1#0yun2#3, and the 'shang' can be split into two phonemes, namely 'sh' and 'ang', each of which corresponds to a pronunciation duration.
In the splicing process, the voice information corresponding to the sentence dividing sequence needs to be cut off based on the redundant phoneme duration at the head or the tail of each sentence dividing sequence, so as to remove the redundant phoneme duration at the head or the tail of the voice information.
Taking the first speech information as an example, when the first speech information is speech, after the speech is synthesized, the duration corresponding to the redundant phonemes at the head or the tail of the first sub-prosodic phoneme sequence in the speech is cut off, and the cut-off first speech information is generated.
Under the condition that the first voice information is the high-level acoustic feature, after the first voice information is synthesized, cutting off the duration corresponding to redundant phonemes at the head or the tail of a first sub-prosodic phoneme sequence in the high-level acoustic feature corresponding to the first voice information to generate the cut-off high-level acoustic feature; the truncated high-level acoustic features are then speech synthesized using a vocoder to generate truncated first speech information.
The interception method of the second voice message is the same as the first voice message, and is not described herein.
And then based on the segmentation sequence of the first sub-prosodic phoneme sequence corresponding to the first voice information in the prosodic phoneme sequence and the segmentation sequence of the second sub-prosodic phoneme sequence corresponding to the second voice information in the prosodic phoneme sequence, sequentially splicing the segmented voice information corresponding to the adjacent sentence sequences from the first sentence sequence until the voice information corresponding to all the sentence sequences is spliced.
According to the speech synthesis method provided by the embodiment of the application, the first speech information and the second speech information are spliced based on the phoneme duration, so that the naturalness and the fluency of the splicing part of the adjacent speech information can be improved on the basis of not needing to preset a speech splicing unit library.
The following describes a speech synthesis apparatus provided in an embodiment of the present application, and the speech synthesis apparatus described below and the speech synthesis method described above may be referred to correspondingly.
As shown in fig. 3, the speech synthesis apparatus includes: a first processing module 310, a second processing module 320, and a third processing module 330.
A first processing module 310, configured to segment a prosodic phoneme sequence of a target text to generate a plurality of sentence sequences, where the prosodic phoneme sequence includes a plurality of phonemes corresponding to the target text and prosodic identifiers located between adjacent phonemes, and each sentence sequence includes at least one phoneme;
the second processing module 320 is configured to perform speech synthesis on a first sub-prosodic phoneme sequence in the multiple sentence sequences to obtain first speech information;
the third processing module 330 is configured to output the first speech information and perform speech synthesis on a second sub-prosodic phoneme sequence in the multiple sentence sequences to generate second speech information, where the second sub-prosodic phoneme sequence is at least one sentence sequence located after the first sub-prosodic phoneme sequence in the prosodic phoneme sequence.
According to the voice synthesis device provided by the embodiment of the application, the target text is segmented into the multiple clause sequences, the first clause sequence is preferentially subjected to voice synthesis to generate the first voice information, and the subsequent clause sequences are continuously subjected to voice synthesis in the process of outputting the first voice information, so that the feedback speed of the system after the network voice synthesis service request is received is effectively increased, the waiting time of a user is shortened, and the use experience of the user is improved.
In some embodiments, the first processing module 310 is further configured to:
converting the target text into a prosodic phoneme sequence, wherein the prosodic phoneme sequence comprises prosodic identifiers located between adjacent phonemes and a plurality of phonemes corresponding to the target text;
the prosodic phoneme sequence is segmented based on at least a portion of the plurality of prosodic identifiers to generate a plurality of sentence sequences, each sentence sequence including at least one phoneme.
In some embodiments, the apparatus may further comprise:
and the fifth processing module is used for combining the first voice information and the second voice information after the second voice information is generated, and generating third voice information.
In some embodiments, the apparatus may further comprise:
a sixth processing module, configured to generate a target file size of the third speech information based on the prosodic phoneme sequence after converting the target text into the prosodic phoneme sequence;
the fourth processing module 340 is further configured to generate second voice information based on the target file size.
In some embodiments, the sixth processing module is further configured to:
generating a predicted file size of the third speech information based on the prosodic phoneme sequence;
correcting the size of the predicted file based on the target residual value to generate the size of the target file; the target residual value is determined based on the size of the sample file and the size of the sample audio file corresponding to the predicted sample text, where the sample file size is the actual size of the sample audio file corresponding to the sample text.
In some embodiments, the fifth processing module is further configured to combine the first speech information and the second speech information based on a phoneme duration corresponding to the first speech information and a phoneme duration corresponding to the second speech information.
In some embodiments, the apparatus may further comprise:
the seventh processing module is used for acquiring a text to be synthesized before converting the target text into the prosodic phoneme sequence;
and under the condition that the size of the text to be synthesized exceeds a target threshold value, segmenting the text to be synthesized to generate a target text, wherein the size of the target text does not exceed the target threshold value.
In some embodiments, the first processing module 310 is further configured to:
determining a first cut location in the sequence of prosodic phonemes based on the plurality of prosodic identifiers;
determining a second slicing position from among the positions of the prosodic phoneme sequence at which the prosodic identifier located after the first slicing position in the prosodic phoneme sequence is located;
and segmenting the prosodic phoneme sequence based on the first segmentation position and the second segmentation position to generate a first sub-prosodic phoneme sequence and at least two second sub-prosodic phoneme sequences, wherein the first sub-prosodic phoneme sequence is a prosodic phoneme sequence located before the first segmentation position in the prosodic phoneme sequence, the at least two second sub-prosodic phoneme sequences are prosodic phoneme sequences located after the first segmentation position in the prosodic phoneme sequence, the adjacent second sub-prosodic phoneme sequences are determined based on the second segmentation position, and the voice synthesis duration corresponding to the first sub-prosodic phoneme sequence is within the target duration.
In some embodiments, the first processing module 310 is further configured to:
obtaining prosodic words, syllables, prosodic phrases, sentence end information and intonation phrases of the target text;
and marking the target text based on at least two of prosodic words, syllables, prosodic phrases, sentence end information and intonation phrases to generate a prosodic phoneme sequence.
In some embodiments, the first processing module 310 is further configured to:
converting the target text into a phoneme sequence;
generating a plurality of prosodic identifiers based on at least two of prosodic words, syllables, prosodic phrases, sentence end information, and intonation phrases;
the phoneme sequence is labeled based on the plurality of prosodic identifiers to generate a prosodic phoneme sequence.
Fig. 4 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 4: a processor (processor)410, a communication Interface 420, a memory (memory)430 and a communication bus 440, wherein the processor 410, the communication Interface 420 and the memory 430 are communicated with each other via the communication bus 440. Processor 410 may invoke logic instructions in memory 430 to perform a speech synthesis method comprising: segmenting a prosodic phoneme sequence of a target text to generate a plurality of sentence sequences, wherein the prosodic phoneme sequence comprises a plurality of phonemes corresponding to the target text and prosodic identifiers positioned between adjacent phonemes, and each sentence sequence comprises at least one phoneme; performing voice synthesis on a first sub-prosodic phoneme sequence in the plurality of sentence sequences to obtain first voice information; and outputting the first speech information and performing speech synthesis on a second sub-prosodic phoneme sequence of the plurality of sentence sequences to generate second speech information, wherein the second sub-prosodic phoneme sequence is at least one sentence sequence located after the first sub-prosodic phoneme sequence in the prosodic phoneme sequence.
In addition, the logic instructions in the memory 430 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solutions of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Further, the present application also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer-readable storage medium, the computer program, when being executed by a processor, being capable of executing the method for speech synthesis provided by the above-mentioned method embodiments, the method comprising: segmenting a prosodic phoneme sequence of a target text to generate a plurality of sentence sequences, wherein the prosodic phoneme sequence comprises a plurality of phonemes corresponding to the target text and prosodic identifiers positioned between adjacent phonemes, and each sentence sequence comprises at least one phoneme; performing voice synthesis on a first sub-prosodic phoneme sequence in the plurality of sentence sequences to obtain first voice information; and outputting the first speech information and performing speech synthesis on a second sub-prosodic phoneme sequence of the plurality of sentence sequences to generate second speech information, wherein the second sub-prosodic phoneme sequence is at least one sentence sequence located after the first sub-prosodic phoneme sequence in the prosodic phoneme sequence.
In another aspect, an embodiment of the present application further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented to perform the method for speech synthesis provided in the foregoing embodiments when executed by a processor, where the method includes: segmenting a prosodic phoneme sequence of a target text to generate a plurality of sentence sequences, wherein the prosodic phoneme sequence comprises a plurality of phonemes corresponding to the target text and prosodic identifiers positioned between adjacent phonemes, and each sentence sequence comprises at least one phoneme; performing voice synthesis on a first sub-prosodic phoneme sequence in the plurality of sentence sequences to obtain first voice information; and outputting the first speech information and performing speech synthesis on a second sub-prosodic phoneme sequence of the plurality of sentence sequences to generate second speech information, wherein the second sub-prosodic phoneme sequence is at least one sentence sequence located after the first sub-prosodic phoneme sequence in the prosodic phoneme sequence.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.
The above embodiments are merely illustrative of the present application and are not intended to limit the present application. Although the present application has been described in detail with reference to the embodiments, those skilled in the art should understand that various combinations, modifications and equivalents may be made to the technical solutions of the present application without departing from the spirit and scope of the technical solutions of the present application, and the technical solutions of the present application should be covered by the claims of the present application.

Claims (14)

1. A speech synthesis method, comprising:
segmenting a prosodic phoneme sequence of a target text to generate a plurality of sentence sequences, wherein the prosodic phoneme sequence comprises a plurality of phonemes corresponding to the target text and prosodic identifiers located between adjacent phonemes, and each sentence sequence comprises at least one phoneme;
performing voice synthesis on a first sub-prosodic phoneme sequence in the plurality of sentence sequences to obtain first voice information;
and outputting the first voice information and performing voice synthesis on a second sub-prosodic phoneme sequence in the plurality of sentence sequences to generate second voice information, wherein the second sub-prosodic phoneme sequence is at least one sentence sequence located after the first sub-prosodic phoneme sequence in the prosodic phoneme sequence.
2. The speech synthesis method of claim 1, wherein after the generating second speech information, the method further comprises:
and combining the second voice information and the first voice information to generate third voice information.
3. The method of speech synthesis according to claim 2, wherein before said parsing the prosodic phoneme sequence of the target text, the method further comprises:
generating a target file size of the third speech information based on the prosodic phoneme sequence;
the generating of the second voice information includes: and generating the second voice information based on the target file size.
4. The speech synthesis method according to claim 3, wherein the generating a target file size of the third speech information based on the prosodic phoneme sequence includes:
generating a predicted file size of the third speech information based on the prosodic phoneme sequence;
and correcting the size of the predicted file based on a target residual value to generate the size of the target file, wherein the target residual value is determined based on the size of the sample file and the size of a sample audio file corresponding to the predicted sample text, and the size of the sample file is the actual size of the sample audio file corresponding to the sample text.
5. The speech synthesis method of claim 2, wherein the combining the second speech information and the first speech information comprises:
and combining the second voice information and the first voice information based on the phoneme duration corresponding to the second voice information and the phoneme duration corresponding to the first voice information.
6. A speech synthesis method according to any of claims 1-5, characterised in that before the parsing of the prosodic phoneme sequence of the target text, the method further comprises:
acquiring a text to be synthesized;
and determining that the size of the text to be synthesized exceeds a target threshold, segmenting the text to be synthesized to generate the target text, wherein the size of the target text does not exceed the target threshold.
7. The speech synthesis method of any one of claims 1-5, wherein the segmenting the prosodic phoneme sequence of the target text to generate a plurality of sentence sequences comprises:
converting the target text into the prosodic phoneme sequence;
segmenting the prosodic phoneme sequence based on at least a portion of the plurality of prosodic identifiers to generate the plurality of sentence sequences.
8. The speech synthesis method of claim 7, wherein the parsing the prosodic phoneme sequence based on at least a portion of the plurality of prosodic identifiers to generate the plurality of sentence sequences comprises:
determining a first cut location in the sequence of prosodic phonemes based on a plurality of the prosodic identifiers;
determining a second parsing position from the prosodic identifier in the prosodic phoneme sequence located after the first parsing position in the prosodic phoneme sequence;
and segmenting the prosodic phoneme sequence based on the first segmentation position and the second segmentation position to generate a first sub-prosodic phoneme sequence and at least two second sub-prosodic phoneme sequences, wherein the first sub-prosodic phoneme sequence is a prosodic phoneme sequence located before the first segmentation position in the prosodic phoneme sequence, the at least two second sub-prosodic phoneme sequences are prosodic phoneme sequences located after the first segmentation position in the prosodic phoneme sequence, the adjacent second sub-prosodic phoneme sequences are determined based on the second segmentation position, and the voice synthesis duration corresponding to the first sub-prosodic phoneme sequence is within a target duration.
9. The speech synthesis method of claim 7, wherein the converting the target text into the prosodic phoneme sequence comprises:
acquiring syllable, prosodic words, prosodic phrases, intonation phrases and sentence end information of the target text;
marking the target text based on at least two of the syllables, the prosodic words, the prosodic phrases, the intonation phrases and the sentence end information to generate the prosodic phoneme sequence.
10. The method of synthesizing speech according to claim 9, wherein said generating the prosodic phoneme sequence based on the labeling of the target text with at least two of the syllable, the prosodic word, the prosodic phrase, the intonation phrase, and the sentence end information comprises:
converting the target text into a phoneme sequence;
generating the plurality of prosodic identifiers based on at least two of the syllables, the prosodic words, the prosodic phrases, the intonation phrases, and the period end information;
generating the prosodic phoneme sequence based on the plurality of prosodic identifiers marking the phoneme sequence.
11. A speech synthesis apparatus, comprising:
the prosodic phoneme sequence generating module is used for segmenting a prosodic phoneme sequence of a target text to generate a plurality of sentence segmentation sequences, wherein the prosodic phoneme sequence comprises a plurality of phonemes corresponding to the target text and prosodic identifiers positioned between adjacent phonemes, and each sentence segmentation sequence comprises at least one phoneme;
the second processing module is used for carrying out voice synthesis on a first sub-prosodic phoneme sequence in the plurality of sentence sequences to obtain first voice information;
a third processing module, configured to output the first speech information and perform speech synthesis on a second sub-prosodic phoneme sequence in the multiple sentence sequences to generate second speech information, where the second sub-prosodic phoneme sequence is at least one sentence sequence located after the first sub-prosodic phoneme sequence in the prosodic phoneme sequence.
12. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the speech synthesis method according to any one of claims 1 to 10 when executing the program.
13. A non-transitory computer-readable storage medium on which a computer program is stored, the computer program, when being executed by a processor, implementing the speech synthesis method according to any one of claims 1 to 10.
14. A computer program product comprising a computer program, characterized in that the computer program realizes the speech synthesis method according to any one of claims 1 to 10 when executed by a processor.
CN202210344448.XA 2022-03-31 2022-03-31 Speech synthesis method and speech synthesis device Pending CN114678001A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202210344448.XA CN114678001A (en) 2022-03-31 2022-03-31 Speech synthesis method and speech synthesis device
PCT/CN2022/118072 WO2023184874A1 (en) 2022-03-31 2022-09-09 Speech synthesis method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210344448.XA CN114678001A (en) 2022-03-31 2022-03-31 Speech synthesis method and speech synthesis device

Publications (1)

Publication Number Publication Date
CN114678001A true CN114678001A (en) 2022-06-28

Family

ID=82075989

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210344448.XA Pending CN114678001A (en) 2022-03-31 2022-03-31 Speech synthesis method and speech synthesis device

Country Status (1)

Country Link
CN (1) CN114678001A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023184874A1 (en) * 2022-03-31 2023-10-05 美的集团(上海)有限公司 Speech synthesis method and apparatus
CN116884399A (en) * 2023-09-06 2023-10-13 深圳市友杰智新科技有限公司 Method, device, equipment and medium for reducing voice misrecognition

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023184874A1 (en) * 2022-03-31 2023-10-05 美的集团(上海)有限公司 Speech synthesis method and apparatus
CN116884399A (en) * 2023-09-06 2023-10-13 深圳市友杰智新科技有限公司 Method, device, equipment and medium for reducing voice misrecognition
CN116884399B (en) * 2023-09-06 2023-12-08 深圳市友杰智新科技有限公司 Method, device, equipment and medium for reducing voice misrecognition

Similar Documents

Publication Publication Date Title
JP7464621B2 (en) Speech synthesis method, device, and computer-readable storage medium
US11443733B2 (en) Contextual text-to-speech processing
CN109389968B (en) Waveform splicing method, device, equipment and storage medium based on double syllable mixing and lapping
US8224645B2 (en) Method and system for preselection of suitable units for concatenative speech
US6778962B1 (en) Speech synthesis with prosodic model data and accent type
CN108899009B (en) Chinese speech synthesis system based on phoneme
US8990089B2 (en) Text to speech synthesis for texts with foreign language inclusions
CN112309366B (en) Speech synthesis method, speech synthesis device, storage medium and electronic equipment
US20080177543A1 (en) Stochastic Syllable Accent Recognition
CN112420016B (en) Method and device for aligning synthesized voice and text and computer storage medium
CN114678001A (en) Speech synthesis method and speech synthesis device
CN103632663B (en) A kind of method of Mongol phonetic synthesis front-end processing based on HMM
JP5320363B2 (en) Speech editing method, apparatus, and speech synthesis method
JPH11249677A (en) Rhythm control method for voice synthesizer
KR100835374B1 (en) Method for predicting phrase break using static/dynamic feature and Text-to-Speech System and method based on the same
CN112309367A (en) Speech synthesis method, speech synthesis device, storage medium and electronic equipment
Chen et al. The ustc system for blizzard challenge 2011
JP2020060642A (en) Speech synthesis system and speech synthesizer
CN112802447A (en) Voice synthesis broadcasting method and device
WO2023184874A1 (en) Speech synthesis method and apparatus
CN114822489A (en) Text transfer method and text transfer device
CN114678002A (en) Text segmentation method and text segmentation device
CN114822490A (en) Voice splicing method and voice splicing device
CN114708848A (en) Method and device for acquiring size of audio and video file
CN115050351A (en) Method and device for generating timestamp and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination