CN118116365A

CN118116365A - Prosodic text generation method and device applied to dialect

Info

Publication number: CN118116365A
Application number: CN202410310337.6A
Authority: CN
Inventors: 龚潇; 周振宇; 赖苏; 李婧; 袁真; 聂朝东; 谭明李
Original assignee: Guangdong Planning and Designing Institute of Telecommunications Co Ltd
Current assignee: Guangdong Planning and Designing Institute of Telecommunications Co Ltd
Priority date: 2024-03-18
Filing date: 2024-03-18
Publication date: 2024-05-31

Abstract

The invention discloses a prosodic text generating method and device applied to dialects, comprising the following steps: cutting a text to be marked of a target dialect to obtain an original unit; determining target dialect pinyin codes matched with target dialects according to the types of the target dialects; coding any original unit based on target dialect pinyin coding to obtain a corresponding target unit; and sequencing all the target units according to the sequencing order of the original units corresponding to each target unit in the text to be marked, so as to obtain the dialect prosody text. Therefore, the target dialect phonetic coding is used for sequentially coding the text to be marked by taking the original unit as a unit, the text to be marked is converted into the dialect prosody text conforming to the pronunciation characteristics of the dialect through the dialect prosody characteristics carried by the target dialect phonetic coding, and the accurate prosody characteristics can be captured, so that the generation accuracy of the dialect prosody text is improved, and the accuracy of the dialect prosody expression during speech synthesis is improved.

Description

Prosodic text generation method and device applied to dialect

Technical Field

The invention relates to the technical field of natural language processing, in particular to a prosodic text generating method and device applied to dialects.

Background

Prosodic text is a text representation that adds prosodic information on the basis of plain text. Prosodic information embedded in prosodic text, including pronunciation of Chinese characters and numbers. These prosodic information is an important component of the language, greatly affecting the rhythm and melody of the language. Prosodic Text has a wide range of applications including, but not limited to, speech-to-Speech (TTS), speech recognition, and Speech emotion analysis. In speech synthesis, prosodic text may help generate more natural and expressive speech. In speech recognition, prosodic information may help to improve recognition accuracy of sentence structures and punctuation. In speech emotion analysis, prosodic information can then be used as an important clue to the emotion state.

However, the existing prosodic text is often generated based on the pronunciation rules of mandarin, and because the prosodic characteristics of dialects are greatly different from those of standard languages such as mandarin, when prosodic text of dialects is generated based on mandarin data, the accurate prosodic characteristics of the prosodic text of dialects may not be captured, so that the generated speech cannot accurately express the prosodic features of the dialects when speech synthesis is performed according to the prosodic text.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a prosodic text generating method and a prosodic text generating device applied to dialects, which can generate prosodic texts conforming to prosodic characteristics of the dialects for the dialects, and improve the accuracy of prosodic expression of the dialects during speech synthesis.

In order to solve the technical problem, the first aspect of the present invention discloses a prosodic text generating method applied to dialects, the method comprising:

Segmenting a text to be marked of a target dialect according to a preset text segmentation mode to obtain all original units of the text to be marked, wherein all the original units comprise one or more of Chinese characters, pinyin character strings and digital character strings;

determining target dialect pinyin codes matched with the target dialect from a plurality of preset dialect pinyin codes according to the type of the target dialect, wherein the target dialect pinyin codes are used for representing pronunciations of all syllables in the target dialect;

for any original unit, encoding the original unit based on the target dialect pinyin encoding to obtain a target unit of the original unit;

And sequencing all the target units according to the sequencing order of the original units corresponding to each target unit in the text to be annotated, so as to obtain the dialect prosody text corresponding to the text to be annotated.

As an optional implementation manner, in the first aspect of the present invention, the prosodic text generating method applied to dialects further includes:

performing phonetic analysis on all phonemes in a plurality of sample dialects, extracting basic phonetic features, and performing phonetic analysis on any one of the plurality of sample dialects, wherein the plurality of sample dialects comprise the target dialects;

Determining a dialect initial consonant character string set and a dialect final sound character string set according to the basic voice characteristics, wherein the dialect initial consonant character string set comprises all dialect initial consonant character strings representing the initial part pronunciation of all syllables in the sample dialect, and the dialect final sound character string set comprises all dialect final sound character strings representing the non-initial part pronunciation of all syllables in the sample dialect;

Determining a dialect tone character set according to the basic voice features, wherein the dialect tone character set comprises all dialect tone characters representing all pronunciation tones in the sample dialect;

For any syllable of the sample dialect, determining syllable pinyin codes corresponding to the syllable according to a dialect initial character string, a dialect final character string and a dialect tone character corresponding to the syllable according to a preset dialect pinyin coding sequence, and taking the syllable pinyin codes corresponding to all syllables of the sample dialect as the dialect pinyin codes corresponding to the sample dialect, wherein the dialect initial character string is a character string capable of defaulting;

and determining the dialect pinyin codes corresponding to all the sample dialects, and determining the dialect pinyin codes as a plurality of predetermined dialect pinyin codes.

As an optional implementation manner, in the first aspect of the present invention, the determining, according to the basic speech feature, a dialect initial string set and a dialect final string set includes:

performing phonetic analysis on all phonemes in the standard language, and extracting standard language phonetic features;

Comparing the pronunciation difference of the basic voice feature and the standard language voice feature to obtain a pronunciation difference comparison result, and modifying the initial consonant character strings of the standard language according to the pronunciation difference comparison result to obtain a dialect initial consonant character string set; and modifying the vowel strings of the standard language according to the pronunciation difference comparison result to obtain a dialect vowel string set.

In a first aspect of the present invention, the encoding the original unit based on the target dialect pinyin encoding for any of the original units to obtain a target unit of the original unit includes:

When the original unit is the digital character string, determining a current application scene of the target dialect, acquiring a dialect digital pronunciation code of all digits of the single digits in the current application scene and a dialect digital combination pronunciation mode of a digital combination from the target dialect pinyin code, and judging whether the digital character string is a one-digit number; when the digital character string is judged to be a one-digit number, encoding the digital character string according to the dialect digital pronunciation code corresponding to the number of the single-digit number corresponding to the digital character string in the current application scene to obtain a target unit of the digital character string; when the digital character string is judged to be not one digit, encoding the digital character string according to the dialect digital pronunciation codes corresponding to all the digits of the single digits included in the digital character string in the current application scene and the dialect digital combination pronunciation modes corresponding to the digits combined by the digital character string in the current application scene, so as to obtain a target unit of the digital character string, wherein the dialect digital pronunciation codes of each digit in the same application scene are the same, the dialect digital pronunciation codes of each digit in different application scenes are different, the dialect digital combination pronunciation modes of each digit combination in the same application scene are different, the dialect digital combination pronunciation modes of each digit combination in different application scenes are different, the category of the digit combination is determined by the context of the digit combination in the text to be annotated, and the dialect digital pronunciation codes of each digit combination in the current application scene are the unique syllable digital codes of the corresponding digits in the current application scene;

When the original unit is the Chinese character, determining a unique code capable of describing syllables corresponding to the Chinese character in the target dialect from the target dialect pinyin code as a Chinese pronunciation code of the Chinese character, and encoding the Chinese character by using the Chinese pronunciation code to obtain a target unit of the Chinese character;

When the original unit is the Pinyin character string, determining a unique code which can describe syllables corresponding to the Pinyin character string in the target dialect and corresponds to the Chinese characters corresponding to the Pinyin character string from the target dialect Pinyin code as a Pinyin pronunciation code of the Pinyin character string, and encoding the Pinyin character string by using the Pinyin pronunciation code to obtain a target unit of the Pinyin character string.

In an optional implementation manner, in the first aspect of the present invention, the original units further include punctuation characters, and before the sorting of all the target units according to the sorting order of the original units corresponding to each target unit in the text to be annotated, to obtain the dialect prosodic text corresponding to the text to be annotated, the method further includes:

determining punctuation codes corresponding to each punctuation character according to the type of the target dialect, wherein the punctuation codes are used for representing pause time of the corresponding position of the punctuation character in the voice stream at the position of the text to be marked; coding the punctuation characters according to the punctuation codes to obtain target units of the punctuation characters;

Wherein the punctuation code is determined by:

according to the type of the target dialect, at least one voice pause mode is predefined, and each voice pause mode keeps a mute mode with different duration in a voice stream;

defining a unique voice pause mode code symbol for each of said voice pause modes;

And determining a voice pause mode coding symbol matched with each sample punctuation character in all sample punctuation characters of the target dialect to obtain the punctuation code, wherein all sample punctuation characters comprise the punctuation characters.

In an optional implementation manner, in the first aspect of the present invention, before the segmenting the text to be annotated of the target dialect according to a preset text segmentation manner, the method further includes:

Determining a text cleaning mode according to the type of the target dialect, and performing text cleaning operation on the text to be marked according to the text cleaning mode, wherein the text cleaning operation comprises deleting repeated characters and/or deleting preset illegal characters; and/or

Determining a punctuation mark deleting mode according to the type of the target dialect, and performing punctuation mark deleting operation on the text to be marked according to the punctuation mark deleting mode; and/or

Judging whether other language texts with different types from the target dialect exist in the text to be marked, and if so, converting the other language texts into texts with the same type as the target dialect.

In an optional implementation manner, in the first aspect of the present invention, after the sorting all the target units according to the sorting order of the original units corresponding to each target unit in the text to be annotated, the method further includes:

converting the dialect prosody text into a playable dialect voice file using a predetermined dialect voice synthesis model;

The dialect voice synthesis model is obtained by inputting the dialect prosody text into a voice synthesis model corresponding to a standard language for transfer learning.

The second aspect of the present invention discloses a prosodic text generating device applied to dialects, the device comprising:

The segmentation module is used for segmenting the text to be marked of the target dialect according to a preset text segmentation mode to obtain all original units of the text to be marked, wherein all the original units comprise one or more of Chinese characters, pinyin character strings and digital character strings;

the determining module is used for determining target dialect pinyin codes matched with the target dialect from a plurality of preset dialect pinyin codes according to the type of the target dialect, and the target dialect pinyin codes are used for representing pronunciations of all syllables in the target dialect;

the coding module is used for coding any original unit based on the target dialect pinyin coding to obtain a target unit of the original unit;

And the sequencing module is used for sequencing all the target units according to the sequencing order of the original units corresponding to each target unit in the text to be annotated, so as to obtain the dialect prosody text corresponding to the text to be annotated.

As an alternative embodiment, in the second aspect of the present invention, the apparatus further includes:

the phonetic analysis module is used for carrying out phonetic analysis on all phonemes in the sample dialect for any one of a plurality of sample dialects, extracting basic phonetic features, and the plurality of sample dialects comprise the target dialect;

The dialect syllable analysis module is used for determining a dialect initial consonant character string set and a dialect final character string set according to the basic voice characteristics, wherein the dialect initial consonant character string set comprises all dialect initial consonant character strings representing the initial part pronunciation of all syllables in the sample dialect, and the dialect final character string set comprises all dialect final character strings representing the non-initial part pronunciation of all syllables in the sample dialect;

The dialect syllable analysis module is further used for determining a dialect tone character set according to the basic voice characteristics, wherein the dialect tone character set comprises all dialect tone characters representing all pronunciation tones in the sample dialect;

and the dialect syllable coding module is used for determining syllable pinyin codes corresponding to the syllables according to the dialect initial consonant character string, the dialect final sound character string and the dialect tone character corresponding to the syllables according to a preset dialect pinyin coding sequence for any syllable of the sample dialect, and taking the syllable pinyin codes corresponding to all syllables of the sample dialect as the dialect pinyin codes corresponding to the sample dialect, wherein the dialect initial consonant character string is a character string capable of defaulting.

The determining module is further configured to determine, after the dialect syllable encoding module determines the dialect pinyin codes corresponding to all the sample dialects, and determine the dialect pinyin codes corresponding to all the sample dialects as a plurality of dialect pinyin codes determined in advance.

In a second aspect of the present invention, the specific manner of determining the dialect initial string set and the dialect final string set by the dialect syllable parsing module according to the basic speech feature is as follows:

In a second aspect of the present invention, the specific manner of the encoding module encoding any of the original units based on the target dialect pinyin encoding to obtain the target unit of the original unit is:

When the original unit is the digital character string, determining a current application scene of the target dialect, acquiring a dialect digital pronunciation code of all digits of the single digits in the current application scene and a dialect digital combination pronunciation mode of a digital combination from the target dialect pinyin code, and judging whether the digital character string is a one-digit number; when the digital character string is judged to be a one-digit number, encoding the digital character string according to the dialect digital pronunciation code corresponding to the number of the single-digit number corresponding to the digital character string in the current application scene to obtain a target unit of the digital character string; when it is judged that the digital character string is not one digit, encoding the digital character string according to the dialect digital pronunciation code corresponding to all digits of the single digits included in the digital character string in the current application scene and the dialect digital combination pronunciation mode corresponding to a digital combination composed of the digital character string in the current application scene, so as to obtain a target unit of the digital character string, wherein the dialect digital pronunciation code of each digit of the single digits in the same application scene is the same, the dialect digital pronunciation code of each digit of the single digits in different application scenes is different, the dialect digital combination pronunciation mode of each digit combination in the same application scene is the same, and the dialect digital combination pronunciation mode of each digit combination in different application scenes is different, the digital combination comprises at least two digits of the single digits, the type of the digital combination is determined by the context of the digital combination in the text to be annotated, and the dialect digital pronunciation code of the single digits in the current application scene is the unique syllable number of the corresponding digits in the target code;

As an alternative embodiment, in the second aspect of the present invention, the original unit further includes punctuation characters, and the apparatus further includes:

The second coding module is used for determining punctuation codes corresponding to each punctuation character according to the type of the target dialect before the sorting module sorts all the target units according to the sorting sequence of the original units corresponding to each target unit in the text to be marked to obtain the dialect prosody text corresponding to the text to be marked, wherein the punctuation codes are used for representing the pause time of the corresponding position of the punctuation character in the voice stream at the position of the text to be marked; coding the punctuation characters according to the punctuation codes to obtain target units of the punctuation characters;

Wherein the punctuation code is determined by:

The text cleaning module is used for determining a text cleaning mode according to the type of the target dialect before the segmentation module segments the text to be marked of the target dialect according to a preset text segmentation mode, and performing text cleaning operation on the text to be marked according to the text cleaning mode, wherein the text cleaning operation comprises deleting repeated characters and/or deleting preset illegal characters;

The punctuation mark deleting module is used for determining a punctuation mark deleting mode according to the type of the target dialect before the segmentation module segments the text to be marked of the target dialect according to a preset text segmentation mode, and carrying out punctuation mark deleting operation on the text to be marked according to the punctuation mark deleting mode;

the language conversion module is used for judging whether other language texts with different types from the target dialect exist in the text to be marked before the text to be marked of the target dialect is segmented by the segmentation module according to a preset text segmentation mode, and if the other language texts exist, converting the other language texts into texts with the same type as the target dialect.

The voice synthesis module is used for sorting all the target units according to the sorting order of the original units corresponding to each target unit in the text to be marked, and converting the dialect prosody text into a dialect voice file capable of being played by using a predetermined dialect voice synthesis model after the dialect prosody text corresponding to the text to be marked is obtained;

In a third aspect, the present invention discloses another prosodic text generating device applied to dialects, the device comprising:

A memory storing executable program code;

a processor coupled to the memory;

The processor invokes the executable program code stored in the memory to perform some or all of the steps of the prosodic text generating method for dialects disclosed in the first aspect of the invention.

The fourth aspect of the present invention discloses a computer-readable storage medium storing computer instructions for executing some or all of the steps of the prosodic text generating method for dialects disclosed in the first aspect of the present invention when the computer instructions are called.

Compared with the prior art, the embodiment of the invention has the following beneficial effects:

In the embodiment of the invention, the text to be marked of the target dialect is segmented according to a preset text segmentation mode, so that all original units of the text to be marked are obtained, and all original units comprise one or more of Chinese characters, pinyin character strings and digital character strings; determining target dialect pinyin codes matched with the target dialect from a plurality of preset dialect pinyin codes according to the type of the target dialect, wherein the target dialect pinyin codes are used for representing pronunciations of all syllables in the target dialect; for any original unit, encoding the original unit based on target dialect pinyin encoding to obtain a target unit of the original unit; and sequencing all the target units according to the sequencing order of the original units corresponding to each target unit in the text to be marked, so as to obtain the dialect prosody text corresponding to the text to be marked. Therefore, the method and the device divide the text to be marked into the original units, determine the target dialect pinyin codes according to the types of the target dialects, sequentially encode the text to be marked by using the target dialect pinyin codes and taking the original units as units, convert the text to be marked into the dialect prosody text conforming to the pronunciation characteristics of the dialects through the prosody characteristics of the target dialects carried by the target dialect pinyin codes, and capture the accurate prosody characteristics of the dialect prosody text, thereby improving the generating accuracy of the dialect prosody text and being beneficial to improving the synthesis of the voice accurately expressing the prosody characteristics of the dialects.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a prosodic text generating method applied to dialects according to an embodiment of the present invention;

FIG. 2 is a flow chart of another prosodic text generating method applied to dialects disclosed in an embodiment of the invention;

fig. 3 is a schematic structural diagram of a prosodic text generating device applied to dialects according to an embodiment of the present invention;

Fig. 4 is a schematic structural view of another prosodic text generating device applied to dialects according to the embodiment of the disclosure;

fig. 5 is a schematic structural view of still another prosodic text generating device applied to a dialect according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The terms first, second and the like in the description and in the claims and in the above-described figures are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, apparatus, article, or article that comprises a list of steps or elements is not limited to only those listed but may optionally include other steps or elements not listed or inherent to such process, method, article, or article.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the invention. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

The prosodic text generating method applied to the dialect can generate the prosodic text reflecting the pronunciation characteristics of the dialect, and can perform further voice processing based on the prosodic text, including but not limited to generating synthetic voice with the pronunciation characteristics of the dialect by using the prosodic text, and is used in navigation, voice assistant and other systems; applying the prosodic text to a dialect recognition and conversion system, recognizing the dialect type of the dialect, and converting the dialect type into standard language-based voice; the prosodic text is applied to a voice control system, a dialect-based voice control command is accurately recognized, and the application of the prosodic text is not limited in the embodiment of the invention. The present invention will be described in detail with reference to the following examples.

Example 1

Referring to fig. 1, fig. 1 is a flowchart of a prosodic text generating method applied to dialects according to an embodiment of the present invention. The method shown in fig. 1 may be applied to application scenarios of various chinese dialects provided with a prosodic text generating device applied to the dialects, for example, northeast, chongqing, guangdong, etc., which are not limited by the embodiments of the present invention. As shown in fig. 1, the prosodic text generating method applied to dialects may include the operations of:

101. And segmenting the text to be marked of the target dialect according to a preset text segmentation mode to obtain all original units of the text to be marked, wherein all original units comprise one or more of Chinese characters, pinyin character strings and digital character strings.

The text splitting manner in this embodiment may refer to splitting all Chinese characters in the text to be marked into a plurality of Chinese characters in a single Chinese character unit, splitting all digits in the text to be marked into a plurality of numeric character strings in a unit of each continuous numeric character, and splitting all pinyin letters in the text to be marked into a plurality of pinyin character strings in a unit of each continuous pinyin character. For example, for a piece of text "my ID is 123, her ID is 456" split to get the following original units "me", "ID", "yes", "123", "she", "ID", "yes", "456" in order.

102. And determining target dialect pinyin codes matched with the target dialect from a plurality of preset dialect pinyin codes according to the type of the target dialect, wherein the target dialect pinyin codes are used for representing the pronunciation of all syllables in the target dialect.

103. And for any original unit, encoding the original unit based on target dialect pinyin encoding to obtain a target unit of the original unit.

104. And sequencing all the target units according to the sequencing order of the original units corresponding to each target unit in the text to be marked, so as to obtain the dialect prosody text corresponding to the text to be marked.

As can be seen, implementing the method described in fig. 1 cuts the text to be marked into original units, determines the target dialect pinyin codes according to the types of the target dialects, and sequentially encodes the text to be marked by using the target dialect pinyin codes in units of the original units, and converts the text to be marked into the dialect prosodic text conforming to the pronunciation characteristics of the dialects through the prosodic features of the dialects carried by the target dialect pinyin codes, so that the accurate prosodic characteristics of the dialect prosodic text can be captured, thereby improving the generating accuracy of the dialect prosodic text, and further being beneficial to improving the synthesis of voices for accurately expressing the prosodic characteristics of the dialects.

In an alternative embodiment, the prosodic text generating method applied to the dialect may further include:

performing phonetic analysis on all phonemes in a plurality of sample dialects, extracting basic phonetic features, and performing target dialects on any one of the plurality of sample dialects;

According to the basic voice characteristics, determining a dialect initial consonant character string set and a dialect final sound character string set, wherein the dialect initial consonant character string set comprises all dialect initial consonant character strings representing pronunciation of all syllable starting parts in a sample dialect, and the dialect final sound character string set comprises all dialect final sound character strings representing pronunciation of all syllable non-starting parts in the sample dialect;

Determining a dialect tone character set according to the basic voice characteristics, wherein the dialect tone character set comprises all dialect tone characters representing all pronunciation tones in the sample dialect;

For any syllable of the sample dialect, determining syllable pinyin codes corresponding to the syllable according to a dialect initial character string, a dialect final character string and a dialect tone character corresponding to the syllable according to a preset dialect pinyin coding sequence, and taking syllable pinyin codes corresponding to all syllables of the sample dialect as dialect pinyin codes corresponding to the sample dialect, wherein the dialect initial character string is a character string capable of defaulting;

And determining the dialect phonetic codes corresponding to all the sample dialects, wherein the dialect phonetic codes are a plurality of dialect phonetic codes which are determined in advance.

Wherein, the phonemes are the minimum pronunciation units in the language, the minimum units or the minimum voice fragments forming syllables are the minimum linear voice units divided from the perspective of tone quality; syllables are a sound unit which is continuously emitted by airflow through a pronunciation organ when speaking, and are also the minimum speech units of combined pronunciation of single vowel phonemes and consonant phonemes in a language, and the single vowel phonemes can also form syllables; the basic speech characteristics refer to acoustic and physiological characteristics formed in the pronunciation process of phonemes, and specifically include factors such as pitch (intonation), duration (duration), tone quality (including opening and closing of vowels, front and rear positions, tongue positions and the like, and clarity, pronunciation parts and pronunciation methods of consonants), and the like, which together define unique pronunciation properties of a syllable; syllable beginning part pronunciation refers to initial consonant or consonant combination in syllable; syllable non-beginning part pronunciation refers to vowels and subsequent consonants in syllables after syllable beginning part pronunciation; tone refers to pitch variation of syllables or phones; the phonetic analysis is to systematically study the acoustic and pronunciation characteristics of speech, including the generation mode of phonemes, audio characteristics, etc. Taking the Chinese character 'horse' as an example, taking the pronunciation corresponding to the Pinyin m of Mandarin as a syllable, the pronunciation corresponding to the Pinyin m of Mandarin as the initial pronunciation of the syllable, the pronunciation corresponding to the Pinyin a of Mandarin as the non-initial pronunciation of the syllable, the pronunciation corresponding to the Pinyin m of Mandarin and the pronunciation corresponding to the Pinyin a of Mandarin as separate phonemes, the tone of the Chinese character 'horse' is a down-up tone, namely, the tone starts at a medium pitch, and then is lowered and raised again.

And triggering to execute the operation of determining the target dialect pinyin codes matched with the target dialect from the plurality of pre-determined dialect pinyin codes according to the type of the target dialect after determining the dialect pinyin codes corresponding to all the sample dialects.

Therefore, according to the alternative embodiment, the dialect initial consonant character string set and the dialect final character string set which are determined according to the basic voice characteristics are further generated, so that the dialect pinyin codes corresponding to various sample dialects can be generated, the accurate target dialect pinyin codes can be generated, the rhythm characteristics of the dialects can be accurately represented, the generation accuracy of rhythm texts corresponding to texts to be marked can be improved, and the inheritance and popularization of the dialects can be facilitated.

In this optional embodiment, further optionally, determining the set of dialect initial strings and the set of dialect final strings according to the basic speech feature includes:

Among them, the standard language is a widely recognized and used language form in a society or cultural group, and is usually standardized and popularized by an official institution. In this embodiment, mandarin chinese is a standard language.

In this embodiment, chongqing is taken as an example, and how to determine the target dialect pinyin code of Chongqing is specifically explained. Comparing pronunciation differences of the Chongqing speech with pronunciation differences of the Mandarin, deleting phonemes which are not in the Chongqing speech, such as a rolling consonant zh in the Mandarin, a rolling consonant ch in the Mandarin, a rolling consonant sh in the Mandarin, voweling in the Mandarin and the like; the phonemes specific to Chongqing are added, such as the non-beginning part pronunciation of syllables corresponding to ' Chinese ' in Chongqing ' old man. Modifying the initial consonants of the mandarin to obtain an initial consonant character string set of the Chongqing:

aa、b、c、d、ee、f、g、h、ii、j、k、l、m、oo、p、q、r、s、t、uu、vv、x、z。

wherein, the initials aa, ee, ii, oo, uu, vv are pseudo initials corresponding to the vowels a, e, i, o, u, v, respectively, and the pseudo initials indicate that no sound is emitted in the voice stream. When the initial consonant in a syllable is short, the pseudo-initial consonant corresponding to the vowel included in the syllable is taken as the initial consonant of the syllable. For example, syllables of Chinese characters 'hungry' have no initial consonants, and the initial consonants in the dialect pinyin codes are pseudo-initial ee corresponding to vowels e. By introducing pseudo-initial consonants into the initial consonant character string set, the initial consonants are ensured to be included in the dialect pinyin codes of all syllables, the coding mode of an original unit can be simplified, the complex introduction of errors in the coding mode is prevented, and the accuracy of prosodic text generation is improved.

Modifying vowels of Mandarin to obtain a vowel character string set of Chongqing:

a、ai、an、ang、ao、e、ei、en、er、i、iu、iy、o、ong、ou、u、ueng、ui、un、v、van、ve、vn、ng、uong、ar、uar、ier、iar、uer。

Wherein, the pronunciation of the final ar is the non-initial part pronunciation of syllable corresponding to 'Chinese' in 'old man' in Chongqing; the pronunciation of vowel iy is the pronunciation of the non-beginning part of the syllable corresponding to 'eating' in Chongqing; the vowel ueng pronounces the non-beginning part of the syllable corresponding to the "about" in the Chongqing; the pronunciation of the vowel van is the pronunciation of the non-beginning part of syllable corresponding to the bowl in the Chongqing; the pronunciation of the final ng is the pronunciation of the non-initial part of syllable corresponding to hard in Chongqing; the vowel uong pronounces the non-beginning part of the corresponding syllable "with" in Chongqing.

The tone type of Chongqing is the same as the tone type of Mandarin. Some Chinese characters have different tones in Chongqing and Mandarin, so when encoding the original unit corresponding to the Chinese character, the encoding should be performed according to the tones of the Chinese character in Chongqing.

Therefore, in this further alternative embodiment, by performing phonetic analysis on all phonemes of the standard language, extracting the phonetic features of the standard language, comparing the phonetic features with the basic phonetic features of the target dialect, and modifying the initial consonant character string and the final character string of the standard language according to the comparison result, an accurate dialect initial consonant character string set and an accurate dialect final character string set are obtained, so that the accuracy of determining the pinyin coding of the target dialect can be improved, the accuracy of generating the prosodic text of the target dialect can be improved, the development time of the prosodic text generating device applied to the dialect can be reduced, and the development resources can be saved.

In another alternative embodiment, for any original unit, encoding the original unit based on target dialect pinyin encoding to obtain a target unit of the original unit includes:

When the original unit is a digital character string, determining a current application scene of a target dialect, acquiring dialect digital pronunciation codes of all digits in the current application scene and a dialect digital combination pronunciation mode of digital combination from target dialect pinyin codes, and judging whether the digital character string is a one-digit number; when the digital character string is judged to be a one-digit number, the digital character string is encoded according to the dialect digital pronunciation code corresponding to the number of the single digit corresponding to the digital character string in the current application scene, so as to obtain a target unit of the digital character string; when the digital character string is judged to be not a single digit, the digital character string is encoded according to the dialect digital pronunciation code corresponding to all digits of the single digit included in the digital character string in the current application scene and the dialect digital combination pronunciation mode corresponding to the digital combination formed by the digital character string in the current application scene to obtain a target unit of the digital character string, the dialect digital pronunciation code of each digit in the same application scene is the same, the dialect digital pronunciation code of each digit in different application scenes is different, the dialect digital combination pronunciation mode of each digit combination in the same application scene is the same, the dialect digital combination pronunciation modes of each digit combination in different application scenes are different, the number combination comprises at least two digits of the single digit, the category of the number combination is determined by the context of the digit combination in the text to be marked, and the dialect digital pronunciation code is the unique code of the syllable corresponding to the digit of the single digit in the current application scene in the target dialect;

When the original unit is a Chinese character, determining a unique code capable of describing syllables corresponding to the Chinese character in a target dialect from target dialect pinyin codes as Chinese pronunciation codes of the Chinese character, and coding the Chinese character by using the Chinese pronunciation codes to obtain a target unit of the Chinese character;

when the original unit is a pinyin character string, determining a unique code which can describe syllables corresponding to Chinese characters corresponding to the pinyin character string in a target dialect from the pinyin codes of the target dialect as the pinyin pronunciation codes of the pinyin character string, and using the pinyin pronunciation codes to code the pinyin character string to obtain the target unit of the pinyin character string.

In this alternative embodiment, the digits of the single digits refer to digits 0-9, and the pronunciation of the digits of the single digits may be different in different application scenarios, for example, digit 1 frequently pronounces the same in a phone number as "Yao" in Mandarin, digit 0 frequently pronounces the same in a phone number as "hole" in Mandarin, and digit 2 frequently pronounces "two" before the word. And the number combination pronunciation modes of different types may also be different, for example, the number combination 1200 of the phone number type pronounces as "two zeros", the number combination 1200 before the word is pronounded as "one thousand two hundred" in some application scenarios, and as "one thousand two" in some application scenarios.

In the alternative embodiment, the number pronunciation codes of a single number in different application scenes and pronunciation modes of different number combination types in different application scenes are determined, and the number character strings are coded according to the current application scenes and the number of bits of the number character strings, so that the coding accuracy of the number character strings in different application scenes is improved; the Chinese character is encoded based on accurate target dialect pinyin encoding and syllables corresponding to the Chinese character in the target dialect, pronunciation of the Chinese character in the target dialect can be accurately reflected, and encoding accuracy of the Chinese character is improved; based on accurate target dialect pinyin coding and syllables corresponding to the pinyin character strings in the target dialect, the pinyin character strings are coded, pronunciation of the Chinese characters corresponding to the pinyin character strings in the target dialect can be accurately reflected, coding accuracy of the pinyin character strings is improved, and generation accuracy of dialect prosodic texts is improved.

In yet another optional embodiment, the original units further include punctuation characters, and before all the target units are ordered according to the ordering order of the original units corresponding to each target unit in the text to be annotated, and the prosodic text generating method applied to the dialect further includes:

Determining punctuation codes corresponding to each punctuation character according to the type of the target dialect, wherein the punctuation codes are used for representing the pause time of the corresponding position of the punctuation character in the voice stream at the position of the text to be marked; coding punctuation characters according to punctuation codes to obtain target units of the punctuation characters; triggering and executing the operation of sequencing all target units according to the sequencing order of the original units corresponding to each target unit in the text to be marked to obtain dialect prosody text corresponding to the text to be marked;

Wherein punctuation coding is determined by:

defining a unique voice pause mode code symbol for each voice pause mode;

And determining a voice pause mode coding symbol matched with each sample punctuation character in all sample punctuation characters of the target dialect to obtain a punctuation code, wherein all sample punctuation characters comprise punctuation characters.

In this alternative embodiment, the punctuation marks are matched with the corresponding voice pause modes to represent that different pause durations are allocated in the voice stream, so that the prosodic text can represent the voice rhythm and pause simulating the pronunciation of the human, the synthesized voice is more consistent with the prosody of the target dialect, and the accuracy of generating the prosody text of the dialect is improved.

In yet another optional embodiment, before the text to be annotated of the target dialect is segmented according to the preset text segmentation mode, the prosodic text generating method applied to the dialect further includes:

Determining a text cleaning mode according to the type of the target dialect, performing text cleaning operation on the text to be marked according to the text cleaning mode, and triggering and executing the operation of cutting the text to be marked of the target dialect according to a preset text cutting mode, wherein the text cleaning operation comprises deleting repeated characters and/or deleting preset illegal characters.

When the text to be marked of the target dialect is cut according to a preset text cutting mode, the illegal character cannot be identified as any one character of Chinese characters, pinyin character strings and digital character strings.

It can be seen that this alternative embodiment reduces noise in the prosodic text, improves the accuracy of generating the prosodic text, and improves the efficiency of generating the prosodic text by deleting repeated and/or illegal characters.

Determining a punctuation mark deleting mode according to the type of the target dialect, performing punctuation mark deleting operation on the text to be marked according to the punctuation mark deleting mode, and triggering and executing the operation of cutting the text to be marked of the target dialect according to a preset text cutting mode.

The method for deleting punctuation marks according to the type of the target dialect can be that the type of the punctuation marks which do not bear the pause time is determined to be the type of the punctuation marks to be deleted according to the type of the target dialect, and all the punctuation marks which are the same as the type of the punctuation marks to be deleted in the text to be marked are deleted one by one or are deleted in batch.

Therefore, the optional embodiment eliminates the interference caused by irrelevant punctuation marks by deleting the punctuation marks which do not bear the pause time, improves the generating accuracy of the dialect prosodic text, improves the generating efficiency of the dialect prosodic text, and simplifies the structure of the prosodic text.

Judging whether other language texts with different types from the target dialect exist in the text to be marked, if so, converting the other language texts into texts with the same type as the target dialect, and triggering and executing the operation of cutting the text to be marked of the target dialect according to a preset text cutting mode.

Therefore, according to the alternative embodiment, through converting other language texts into texts with the same type as the target dialect, the language types of all texts in the text to be marked are ensured to be the same as the target dialect, the generating accuracy of the dialect prosody text is improved, and the accuracy of the dialect prosody expression during speech synthesis is further improved.

In yet another alternative embodiment, after encoding the original unit based on the pinyin encoding of the target dialect, the prosodic text generating method applied to the dialect further includes:

and adding a corresponding extension sound code for each target unit which is determined in advance and needs to be extended, wherein the extension sound code corresponding to each target unit represents that the non-starting part of syllables of the target dialect corresponding to the target unit is required to be extended in the pronunciation of the pronunciation in the voice stream.

The extension code may be specifically placed at a position next to the position where the target unit is located, or may be placed at a position previous to the position where the target unit is located, which is not limited by the embodiment of the present invention. The target unit needing to be lengthened for pronunciation can be any one of a target unit corresponding to a Chinese character, a target unit corresponding to a pinyin character string and a target unit corresponding to a number character string, and the embodiment of the invention is not limited.

It can be seen that this alternative embodiment, by adding the tug code, enables the prosodic text to represent the natural speech characteristics of the human speaking that extend the non-beginning part of the pronunciation of certain syllables in a specific context, thereby improving the accuracy of generating the dialect prosodic text and further enhancing the accuracy and naturalness of speech synthesis.

Adding corresponding recoding for each target unit which is determined to be reread in advance, wherein the recoding corresponding to each target unit represents that the volume of syllables of a target dialect corresponding to the target unit in a voice stream needs to be increased or decreased.

The method and the device for adding the recoding to the target unit specifically can be that the recoding is placed at the position next to the position where the target unit is located, or can be that the recoding is placed at the position previous to the position where the target unit is located, and the embodiment of the invention is not limited. The target unit to be added with the recoding can be any one of a target unit corresponding to a Chinese character, a target unit corresponding to a pinyin character string and a target unit corresponding to a number character string, and the embodiment of the invention is not limited.

After executing corresponding actions on all target units, triggering and executing the operations of sequencing all target units according to the sequencing order of the original units corresponding to each target unit in the text to be annotated to obtain the dialect prosody text corresponding to the text to be annotated, wherein the corresponding actions specifically refer to adding extension voice codes and/or voice recoding for the target units.

Therefore, the optional embodiment enables the prosody text to represent emphasis and weakening in natural language by adding the recoding of the voice, thereby improving the generating accuracy of the prosody text of the dialect and further enhancing the emotion expression of the synthesized voice; on the other hand, which syllables need to be increased or decreased in volume is clearly shown in the voice stream, so that the listener is helped to better capture key information, and the understanding degree of voice content is improved.

And replacing the predetermined text which needs to be subjected to dialect synonym replacement with the dialect text which is the same as the text in terms of semantics, and triggering the execution of the segmentation step of the text to be marked of the target dialect according to a preset text segmentation mode.

In this embodiment, a dialect synonym refers to a commonly used alternative of a word in the standard language in the dialect, taking Chongqing as an example, "thinking" is usually replaced by Chongqing pronounced as Mie, and Chongqing pronounced as Mie is a commonly used alternative of the standard language "thinking" in Chongqing.

Therefore, according to the alternative embodiment, the text to be replaced is replaced by the dialect synonym, so that the real pronunciation habit of the dialect user is better simulated, the generation accuracy of the prosodic text of the dialect is improved, and further, the voice for synthesizing the prosodic feature of the precise expression dialect is improved; and the generated prosodic text accords with the expression habit of the dialect, so that the naturalness and local color of the speech synthesis output are improved, the synthesized speech is more relevant and real, and the hearing experience of a dialect user is improved.

Example two

Referring to fig. 2, fig. 2 is a flowchart of another prosodic text generating method applied to dialects according to an embodiment of the present invention. The method shown in fig. 2 may be applied to application scenarios of various chinese dialects of the prosodic text generating device of the dialects, for example, northeast, chongqing, guangdong, etc., which are not limited by the embodiments of the present invention. As shown in fig. 2, the prosodic text generating method applied to dialects may include the operations of:

201. And segmenting the text to be marked of the target dialect according to a preset text segmentation mode to obtain all original units of the text to be marked, wherein all original units comprise one or more of Chinese characters, pinyin character strings and digital character strings.

202. And determining target dialect pinyin codes matched with the target dialect from a plurality of preset dialect pinyin codes according to the type of the target dialect, wherein the target dialect pinyin codes are used for representing the pronunciation of all syllables in the target dialect.

203. And for any original unit, encoding the original unit based on target dialect pinyin encoding to obtain a target unit of the original unit.

204. And sequencing all the target units according to the sequencing order of the original units corresponding to each target unit in the text to be marked, so as to obtain the dialect prosody text corresponding to the text to be marked.

205. Converting the dialect prosody text into a playable dialect voice file by using a predetermined dialect voice synthesis model;

the dialect voice synthesis model is obtained by inputting dialect prosody text into a voice synthesis model corresponding to a standard language for transfer learning.

In the embodiment of the present invention, the detailed descriptions of steps 201-204 in the first embodiment are referred to as steps 101-104, and are not repeated here.

It can be seen that, in the method described in the second embodiment, the text to be marked is segmented into the original units, the target dialect pinyin codes are determined according to the types of the target dialects, the target dialect pinyin codes are used for sequentially encoding the text to be marked by taking the original units as units, the text to be marked is converted into the dialect prosodic text conforming to the pronunciation characteristics of the dialects through the prosodic features of the dialects carried by the target dialect pinyin codes, and the accurate prosodic characteristics of the dialect prosodic text can be captured, so that the generating accuracy of the dialect prosodic text is improved, and further, the voice synthesizing the prosodic features of the precisely expressed dialects is facilitated to be improved. In addition, the method described in the second embodiment uses the voice synthesis model corresponding to the existing standard language as a basis, and adapts to the target dialect through transfer learning, so that the training time of the voice synthesis model of the target dialect is reduced, and the efficiency of the voice synthesis of the dialect is improved; on the other hand, the transfer learning allows the model to inherit the generalization capability of the voice synthesis model corresponding to the standard language, including understanding the basic characteristics of the voice, so that the target dialect voice synthesis model to be trained is more quickly adapted to new dialect data, and the efficiency of dialect voice synthesis is further improved.

Example III

Referring to fig. 3, fig. 3 is a schematic structural diagram of a prosodic text generating device for dialects according to an embodiment of the invention. The device shown in fig. 3 may be applied to application scenarios of various chinese dialects, for example, northeast, chongqing, guangdong, etc., which is not limited by the embodiment of the present invention.

The apparatus as shown in fig. 3 includes:

The segmentation module 301 is configured to segment a text to be marked of a target dialect according to a preset text segmentation manner, so as to obtain all original units of the text to be marked, where all original units include one or more of kanji characters, pinyin character strings and numeric character strings;

The determining module 302 is configured to determine, according to the type of the target dialect, a target dialect pinyin code that matches the target dialect from among a plurality of dialect pinyin codes that are determined in advance, where the target dialect pinyin code is used to represent pronunciations of all syllables in the target dialect;

The encoding module 303 is configured to encode any original unit based on target dialect pinyin encoding to obtain a target unit of the original unit;

And the sorting module 304 is configured to sort all the target units according to the sorting order of the original units corresponding to each target unit in the text to be annotated, so as to obtain the dialect prosody text corresponding to the text to be annotated.

As can be seen, the device described in fig. 3 divides the text to be marked into original units, determines the target dialect pinyin codes according to the types of the target dialects, sequentially encodes the text to be marked by using the target dialect pinyin codes and taking the original units as units, converts the text to be marked into the dialect prosodic text conforming to the pronunciation characteristics of the dialects through the prosodic features of the dialects carried by the target dialect pinyin codes, and can capture the accurate prosodic features of the dialect prosodic text, thereby improving the generating accuracy of the dialect prosodic text and being beneficial to improving the synthesis of the voice for accurately expressing the prosodic features of the dialects.

In still another alternative embodiment, as shown in fig. 4, the prosodic text generating device applied to the dialect further includes:

A phonetic analysis module 305, configured to perform phonetic analysis on all phonemes in a sample dialect for any of a plurality of sample dialects, and extract basic phonetic features, where the plurality of sample dialects includes a target dialect;

the dialect syllable parsing module 306 is configured to determine a dialect initial consonant string set and a dialect final syllable string set according to the basic phonetic features, where the dialect initial consonant string set includes all dialect initial consonant strings representing the initial pronunciation of all syllables in the sample dialect, and the dialect final syllable string set includes all dialect final strings representing the non-initial pronunciation of all syllables in the sample dialect;

The dialect syllable parsing module 306 is further configured to determine a dialect tone character set according to the basic speech features, where the dialect tone character set includes all dialect tone characters representing all pronunciation tones in the sample dialect;

The dialect syllable coding module 307 is configured to determine, for any syllable of the sample dialect, syllable pinyin codes corresponding to the syllable according to the dialect initial character string, the dialect final character string and the dialect tone character corresponding to the syllable according to a predetermined dialect pinyin coding sequence, and take syllable pinyin codes corresponding to all syllables of the sample dialect as dialect pinyin codes corresponding to the sample dialect, where the dialect initial character string is a character string capable of being defaulted.

The determining module 302 is further configured to determine, after the dialect syllable encoding module 307 determines the dialect pinyin codes corresponding to all the sample dialects, and determine the dialect pinyin codes as a plurality of dialect pinyin codes determined in advance.

In this embodiment, a phoneme is the smallest pronunciation unit in a language, the smallest unit or the smallest speech segment constituting a syllable, and the smallest linear speech unit divided from the viewpoint of sound quality; syllables are a sound unit which is continuously emitted by airflow through a pronunciation organ when speaking, and are also the minimum speech units of combined pronunciation of single vowel phonemes and consonant phonemes in a language, and the single vowel phonemes can also form syllables; the basic speech characteristics refer to acoustic and physiological characteristics formed in the pronunciation process of phonemes, and specifically include factors such as pitch (intonation), duration (duration), tone quality (including opening and closing of vowels, front and rear positions, tongue positions and the like, and clarity, pronunciation parts and pronunciation methods of consonants), and the like, which together define unique pronunciation properties of a syllable; syllable beginning part pronunciation refers to initial consonant or consonant combination in syllable; syllable non-beginning part pronunciation refers to vowels and subsequent consonants in syllables after syllable beginning part pronunciation; tone refers to pitch variation of syllables or phones; the phonetic analysis is to systematically study the acoustic and pronunciation characteristics of speech, including the generation mode of phonemes, audio characteristics, etc.

After the dialect syllable coding module 307 determines the dialect pinyin codes corresponding to all the sample dialects, the determining module 302 is triggered to execute the operation of determining the target dialect pinyin code matched with the target dialect from the plurality of dialect pinyin codes determined in advance according to the type of the target dialect.

In this alternative embodiment, further optionally, the dialect syllable parsing module 306 determines, according to the basic speech feature, the specific manner of determining the dialect initial string set and the dialect final string set is:

In yet another alternative embodiment, the encoding module 303 encodes any original unit based on the target dialect pinyin, and the specific manner of obtaining the target unit of the original unit is:

In this alternative embodiment, the digits of the single digits refer to digits 0-9, and the pronunciation modes of the digits of the single digits may be different in different application scenarios, and the pronunciation modes of different types of digits may also be different in combination.

The text cleaning module 308 is configured to determine a text cleaning manner according to a type of the target dialect before the cutting module 301 cuts the text to be marked of the target dialect according to a preset text cutting manner, perform a text cleaning operation on the text to be marked according to the text cleaning manner, where the text cleaning operation includes deleting repeated characters and/or deleting predetermined illegal characters, and trigger the cutting module 301 to perform a cutting operation on the text to be marked of the target dialect according to the preset text cutting manner;

The punctuation mark deleting module 309 is configured to determine a punctuation mark deleting mode according to a type of the target dialect before the segmentation module 301 segments the text to be marked of the target dialect according to a preset text segmentation mode, perform a punctuation mark deleting operation on the text to be marked according to the punctuation mark deleting mode, and trigger the segmentation module 301 to perform a segmentation operation on the text to be marked of the target dialect according to the preset text segmentation mode;

The language conversion module 310 is configured to determine whether other language texts different from the target dialect exist in the text to be marked before the segmentation module 301 segments the text to be marked of the target dialect according to a preset text segmentation mode, and if the other language texts exist, convert the other language texts into texts of the same type as the target dialect, and trigger the segmentation module 301 to execute the operation of segmenting the text to be marked of the target dialect according to the preset text segmentation mode.

In the alternative embodiment, the repeated characters and/or illegal characters are deleted, so that noise in the prosodic text is reduced, the accuracy of generating the prosodic text of the dialect is improved, and the efficiency of generating the prosodic text of the dialect is improved; by deleting punctuation marks which do not bear the pause time, interference caused by irrelevant punctuation marks is eliminated, the accuracy of generating the prosodic text of the dialect is improved, the efficiency of generating the prosodic text of the dialect is improved, and the structure of the prosodic text is simplified; by converting other language texts into texts with the same type as the target dialect, the language types of all texts in the text to be marked are ensured to be the same as the target dialect, the generating accuracy of the dialect prosody text is improved, and the accuracy of dialect prosody expression during speech synthesis is further improved.

The extension sound encoding module 311 is configured to add, after the encoding module 303 encodes the original unit based on the target dialect pinyin encoding to obtain a target unit of the original unit, a corresponding extension sound encoding for each target unit that is predetermined to be subjected to extension sound, where the extension sound encoding corresponding to each target unit represents that the sound of a non-beginning part of a syllable of the target dialect corresponding to the target unit is required to be extended in the speech stream.

After the extension voice coding module 311 performs the action of adding the extension voice coding to all the target units, the sequencing module 304 is triggered to perform the sequencing of all the target units according to the sequencing order of the original units corresponding to each target unit in the text to be annotated, so as to obtain the dialect prosodic text corresponding to the text to be annotated.

And a recoding module 312, configured to add a corresponding recoding to each target unit to be recoded after the encoding module 303 encodes the original unit based on the pinyin encoding of the target dialect to obtain the target unit of the original unit, where the recoding corresponding to each target unit represents that the volume of syllables of the target dialect corresponding to the target unit in the speech stream needs to be increased or decreased.

After the recoding module 312 performs the adding recoding action on all the target units, the ordering module 304 is triggered to perform the operation of ordering all the target units according to the ordering sequence of the original units corresponding to each target unit in the text to be annotated, so as to obtain the dialect prosodic text corresponding to the text to be annotated.

The replacing module 313 is configured to replace a predetermined text that needs to be replaced by a dialect synonym with a dialect text that is the same as the text semantic before the segmenting module 301 segments the text to be annotated of the target dialect according to a preset text segmentation manner, and trigger the segmenting module 301 to execute a step of segmenting the text to be annotated of the target dialect according to the preset text segmentation manner.

In this embodiment, a dialect synonym refers to a commonly used substitution of a word in a language in a dialect.

In still another alternative embodiment, as shown in fig. 4, the original unit further includes punctuation characters, and the prosodic text generating device applied to the dialect further includes:

The second encoding module 314 is configured to determine, according to the type of the target dialect, a punctuation code corresponding to each punctuation character, where the punctuation code is used to represent a pause duration of a position of the punctuation character in the text to be marked, where the position corresponds to the position in the voice stream, before the ordering module 304 orders all the target units according to an ordering order of the original units corresponding to each target unit in the text to be marked, and obtains a dialect prosody text corresponding to the text to be marked; coding punctuation characters according to punctuation codes to obtain target units of the punctuation characters; and triggers the sorting module 304 to perform sorting of all the target units according to the sorting order of the original units corresponding to each target unit in the text to be annotated, so as to obtain the dialect prosodic text corresponding to the text to be annotated.

Wherein punctuation coding is determined by:

defining a unique voice pause mode code symbol for each voice pause mode;

The speech synthesis module 315 is configured to, after the sorting module 304 sorts all target units according to the sorting order of the original units corresponding to each target unit in the text to be annotated, obtain a dialect prosody text corresponding to the text to be annotated, convert the dialect prosody text into a dialect speech file capable of being played by using a predetermined dialect speech synthesis model;

Therefore, the alternative embodiment converts the dialect prosody text conforming to the pronunciation characteristics of the dialect into the dialect voice file capable of being played, and can capture the accurate prosody characteristics of the dialect, so that the generation accuracy of the dialect prosody text is improved, and further, the voice for synthesizing the prosody characteristics of the precise expression dialect is improved. In addition, the optional embodiment uses the voice synthesis model corresponding to the existing standard language as a basis, and adapts to the target dialect through transfer learning, so that the training time of the voice synthesis model of the target dialect is reduced, and the efficiency of the voice synthesis of the dialect is improved; on the other hand, the transfer learning allows the model to inherit the generalization capability of the voice synthesis model corresponding to the standard language, including understanding the basic characteristics of the voice, so that the target dialect voice synthesis model to be trained is more quickly adapted to new dialect data, and the efficiency of dialect voice synthesis is further improved.

Example IV

Referring to fig. 5, fig. 5 is a schematic structural diagram of another prosodic text generating device applied to dialects according to an embodiment of the present invention. The apparatus described in fig. 5 may be applied to an application server. As shown in fig. 5, the apparatus may include:

A memory 501 in which executable program codes are stored;

A processor 502 coupled to the memory 501;

Further, an input interface 503 and an output interface 504 coupled to the processor 502 may also be included;

The processor 502 invokes executable program code stored in the memory 501 to perform the steps of the method described in any of the first to second embodiments of the present invention.

Example five

The embodiment of the invention discloses a computer storage medium which stores computer instructions for executing steps in the method described in any one of the first to second embodiments of the invention when the computer instructions are called.

Example six

Embodiments of the present invention disclose a computer program product comprising a non-transitory computer readable storage medium storing a computer program, the computer program being operable to cause a computer to perform the steps of the method described in any of the first to second embodiments of the present invention.

The apparatus embodiments described above are merely illustrative, wherein the modules illustrated as separate components may or may not be physically separate, and the components shown as modules may or may not be physical, i.e., may be located in one place, or may be distributed over a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above detailed description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course by means of hardware. Based on such understanding, the foregoing technical solutions may be embodied essentially or in part in the form of a software product that may be stored in a computer-readable storage medium including Read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), programmable Read-Only Memory (Programmable Read-Only Memory, PROM), erasable programmable Read-Only Memory (Erasable Programmable Read Only Memory, EPROM), one-time programmable Read-Only Memory (OTPROM), electrically erasable programmable Read-Only Memory (EEPROM), compact disc Read-Only Memory (Compact Disc Read-Only Memory, CD-ROM) or other optical disc Memory, magnetic disc Memory, tape Memory, or any other medium that can be used for computer-readable carrying or storing data.

Finally, it should be noted that: the prosodic text generating method and device applied to dialects disclosed by the embodiment of the invention are only disclosed as the preferred embodiment of the invention, and are only used for illustrating the technical scheme of the invention, but not limiting the technical scheme; although the invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art will understand that; the technical scheme recorded in the various embodiments can be modified or part of technical features in the technical scheme can be replaced equivalently; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims

1. A prosodic text generating method applied to a dialect, the method comprising:

2. The prosodic text generating method applied to a dialect of claim 1, further comprising:

3. The prosodic text generating method applied to a dialect according to claim 2, wherein the determining a set of dialect initial strings and a set of dialect final strings from the basic speech features comprises:

4. A prosodic text generating method according to any of claims 1-3, characterized in that said encoding said original unit based on said target dialect pinyin encoding for any of said original units, obtaining a target unit of said original unit, comprises:

5. A prosodic text generating method according to any of claims 1-3, wherein the original units further comprise punctuation characters, and the method further comprises, before sorting all the target units according to the sorting order of the original units corresponding to each target unit in the text to be marked, to obtain the prosodic text of the dialect corresponding to the text to be marked:

Wherein the punctuation code is determined by:

6. The prosodic text generating method applied to a dialect according to any one of claims 1 to 3, wherein before the text to be annotated of the target dialect is segmented according to a preset text segmentation mode, the method further comprises:

7. A prosodic text generating method according to any of claims 1-3, characterized in that after said sorting all the target units according to the sorting order of the original units corresponding to each of the target units in the text to be annotated, to obtain the prosodic text of the dialect corresponding to the text to be annotated, the method further comprises:

8. A prosodic text generating device for use in a dialect, the device comprising:

9. A prosodic text generating device for use in a dialect, the device comprising:

A memory storing executable program code;

a processor coupled to the memory;

The processor invokes the executable program code stored in the memory to perform the prosodic text generating method applied to dialects as defined in any one of claims 1-7.

10. A computer-readable storage medium storing computer instructions for performing the prosodic text generating method according to any of the claims 1-7, when called.