CN115116428A

CN115116428A - Prosodic boundary labeling method, apparatus, device, medium, and program product

Info

Publication number: CN115116428A
Application number: CN202210555616.XA
Authority: CN
Inventors: 余剑威; 王琰; 戴子茜
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-05-19
Filing date: 2022-05-19
Publication date: 2022-09-27
Anticipated expiration: 2042-05-19
Also published as: CN115116428B

Abstract

The application discloses a prosodic boundary labeling method, a device, equipment, a medium and a program product, and relates to the field of machine learning. The method comprises the following steps: acquiring a target text and a target audio, and extracting text prosody feature representation of the target text by taking characters as analysis granularity; extracting audio prosody feature representation of the target audio based on the analysis of the sound production content; fusing the text prosody feature representation and the audio prosody feature representation to obtain fused prosody feature representation; and performing prosodic boundary prediction on the target text based on the fused prosodic feature representation to obtain a prosodic boundary labeling result with the same length as the target text. The prosodic boundary prediction is carried out on the fusion prosodic feature expression, so that the accuracy of prosodic boundary labeling on the target text is improved; and the character is used as granularity to analyze the target text and predict the prosodic boundary, so that the fine granularity of a prosodic boundary labeling result is improved, and the accuracy of prosodic boundary labeling on the target text is further improved.

Description

Prosodic boundary labeling method, apparatus, device, medium, and program product

Technical Field

Embodiments of the present disclosure relate to the field of machine learning, and in particular, to a method, an apparatus, a device, a medium, and a program product for prosodic boundary labeling.

Background

A Text-to-speech (TTS) system is a computer system that can convert an arbitrary input Text into a corresponding speech, and in the speech synthesis system, a prosodic boundary of the input Text needs to be predicted. The precise prosodic boundaries can make the synthesized speech of the input text closer to human voice and more natural and accurate in expression. Therefore, training data with prosodic boundaries accurately labeled is critical to constructing a high quality speech synthesis system.

In the related art, the prosodic boundary labeling method usually extracts relevant feature information from text content, and performs feature analysis on the relevant feature information, so as to predict the prosodic boundary of the text content.

However, the prosodic boundary labeling method in the related art described above has low accuracy.

Disclosure of Invention

Embodiments of the present application provide a method, an apparatus, a device, a medium, and a program product for prosodic boundary labeling, which can improve the accuracy of prosodic boundary labeling. The technical scheme is as follows:

in one aspect, a prosodic boundary labeling method is provided, where the method includes:

acquiring a target text and a target audio, wherein the text content of the target text is matched with the audio content of the target audio, and the target text is a text to be subjected to prosodic boundary identification;

extracting text prosody feature representation of the target text by taking characters as analysis granularity; extracting audio prosody feature representation of the target audio on the basis of analysis of the sound production content;

fusing the text prosody feature representation and the audio prosody feature representation to obtain a fused prosody feature representation;

performing prosody boundary prediction on the target text based on the fused prosody feature representation to obtain a prosody boundary labeling result with the same length as the target text, wherein the prosody boundary labeling result comprises a prosody boundary which is divided on the target text by taking characters as granularity.

In another aspect, a prosodic boundary labeling apparatus is provided, the apparatus including:

the data acquisition module is used for acquiring a target text and a target audio, wherein the text content of the target text is matched with the audio content of the target audio, and the target text is a text to be subjected to prosodic boundary identification;

the feature extraction module is used for extracting text prosody feature representation of the target text by taking characters as analysis granularity; and extracting an audio prosody feature representation of the target audio on the basis of the analysis of the content of the utterance;

the feature fusion module is used for fusing the text prosody feature representation and the audio prosody feature representation to obtain a fused prosody feature representation;

and the feature analysis module is used for carrying out prosody boundary prediction on the target text based on the fused prosody feature representation to obtain a prosody boundary labeling result with the same length as the target text, wherein the prosody boundary labeling result comprises a prosody boundary which is divided on the target text by taking characters as granularity.

In another aspect, a computer device is provided, which includes a processor and a memory, where at least one instruction, at least one program, a set of codes, or a set of instructions is stored in the memory, and the at least one instruction, the at least one program, the set of codes, or the set of instructions is loaded and executed by the processor to implement the prosodic boundary labeling method according to any one of the embodiments of the present application.

In another aspect, a computer-readable storage medium is provided, in which at least one program code is stored, and the at least one program code is loaded and executed by a processor to implement the prosodic boundary labeling method described in any of the embodiments of the present application.

In another aspect, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the prosodic boundary labeling method described in any of the embodiments of the present application.

The technical scheme provided by the embodiment of the application at least comprises the following beneficial effects:

the text prosody feature representation of the target text and the audio prosody feature representation of the target audio are fused to obtain a fused prosody feature representation, and prosody boundary prediction is performed on the fused prosody feature representation, so that the accuracy of prosody boundary labeling on the target text is improved due to the fact that the audio prosody feature representation contains prosody boundary information; and the character is used as granularity to analyze the target text and predict the prosodic boundary, so that the fine granularity of a prosodic boundary labeling result is improved, and the accuracy of prosodic boundary labeling on the target text is further improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of a prosodic boundary labeling method provided in an exemplary embodiment of the present application;

FIG. 2 is a schematic illustration of an implementation environment provided by an exemplary embodiment of the present application;

FIG. 3 is a flowchart of a prosodic boundary labeling method provided in an exemplary embodiment of the present application;

FIG. 4 is a schematic diagram of prosodic boundary text annotation provided in an exemplary embodiment of the present application;

FIG. 5 is a schematic diagram of prosody scale hierarchy provided by an exemplary embodiment of the present application;

FIG. 6 is a flowchart of a prosodic boundary labeling method provided in another exemplary embodiment of the present application;

FIG. 7 is a schematic illustration of a multimodal fusion model provided by an exemplary embodiment of the present application;

FIG. 8 is a flowchart of a prosodic boundary labeling method provided in another exemplary embodiment of the present application;

FIG. 9 is a schematic diagram of a prosodic boundary annotation model provided in an exemplary embodiment of the present application;

FIG. 10 is an automatic indicator profile provided by an exemplary embodiment of the present application;

FIG. 11 is a schematic diagram of a consistency check coefficient matrix provided by an exemplary embodiment of the present application;

FIG. 12 is a schematic diagram illustrating a process for obtaining training data according to an exemplary embodiment of the present application;

FIG. 13 is profile data provided by an exemplary embodiment of the present application;

FIG. 14 is a block diagram illustrating a prosodic boundary labeling apparatus according to an exemplary embodiment of the present application;

FIG. 15 is a block diagram of a prosodic boundary labeling apparatus according to another exemplary embodiment of the present application;

fig. 16 is a block diagram of a server according to an exemplary embodiment of the present application.

Detailed Description

To make the purpose, technical solutions and advantages of the present application clearer, the following will describe embodiments of the present application in further detail with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," and the like, in this application, are used for distinguishing between similar items and items that have substantially the same function or similar functionality, and it is to be understood that "first" and "second" do not have a logical or temporal dependency, nor do they define a quantity or order of execution.

First, a brief description is given of terms referred to in the embodiments of the present application:

speech Technology (Speech Technology): the key technologies of the Speech technology are Automatic Speech Recognition (ASR) technology, Speech synthesis technology, and voiceprint Recognition technology. The computer can listen, see, speak and feel, and the development direction of future human-computer interaction is provided, wherein voice becomes one of the good human-computer interaction modes in the future.

Prosodic boundaries: prosodic boundaries are used to rank the text prosody, and their location affects the naturalness and meaning of the text when expressed. Different prosody levels represent different prosody, which refers to the rhythm and regularity of sounds in audio. Optionally, the prosody is used to indicate the level, strength, length of pronunciation, and pause time between characters of the sound corresponding to each character in a piece of audio. Wherein the prosodic scale comprises: characters (Character, CC), grammatical words (Lexicon Word, LW), Prosodic words (Prosodic Word, PW), Prosodic phrases (Prosodic Phrase, PPH), Intonation Phrases (IPH).

Phoneme: a phoneme is the smallest unit of speech, and a pronunciation action forms a phoneme, for example: mandarin "I" contains two phonemes, "w" and "o". Wherein the phonemes comprise vowel phonemes and consonant phonemes.

In the related art, the prosodic boundary labeling method mainly comprises the following steps:

(1) the prosodic boundaries are manually marked. This approach is time consuming and costly, and the criteria for prosodic words and prosodic phrases by different annotators are not consistent, so that the same batch of data annotated by different annotators cannot be used directly together.

(2) And automatically marking the prosodic boundary through a prosodic boundary marking model. In the related art, a prosodic boundary labeling model usually extracts relevant feature information from text content, and then performs feature analysis on the relevant feature information, so as to predict prosodic boundaries of the text content, and the prosodic boundary labeling accuracy is low.

The embodiment of the application provides a prosody boundary labeling method, which schematically refers to fig. 1, obtains a target text 101 and a target audio 102 matched with the target text 101, and then analyzes the target text 101 and the target audio 102, so as to perform fine-grained prosody boundary labeling prediction on the target text 101, wherein the prediction specifically includes the following steps:

illustratively, a text prosody feature vector 104 of the target text 101 is extracted by a text encoder 103, and optionally, the text prosody feature vector 104 is a word vector containing the context feature of the target text 101; and extracting an audio prosody feature vector 106 of the target audio 102 through the audio encoder 105, wherein optionally, the audio prosody feature vector 106 is a vector containing prosody related information (such as pitch, intensity, etc.) of the target audio 102; the text prosody feature vector 104 and the audio prosody feature vector 106 are input into a multi-mode fusion decoder 107, the text prosody feature vector 104 and the audio prosody feature vector 106 are fused to obtain a fusion prosody feature vector, the fusion prosody feature vector is analyzed and predicted to obtain a prosody boundary labeling sequence corresponding to each character in the target text, and the target text is aligned with the prosody boundary labeling sequence to obtain text data 108 with precisely labeled prosody boundaries.

The prosodic boundary marking method provided by the embodiment of the application has the performance equivalent to that of manual marking, marking time and cost can be saved, and the prosodic boundaries with different granularities are distinguished by using a unified standard, so that higher marking consistency is obtained; the prosody boundary labeling method provided by the embodiment of the application extracts the prosody boundary related information from the text and the audio respectively, and performs decoding after fusion through the multi-mode decoder, so that a prosody boundary labeling result is obtained.

The prosodic boundary labeling method provided by the embodiment of the application can be at least applied to the following application scenarios:

1、the method is applied to a speech synthesis system.Illustratively, the speech synthesis system is implemented as a Mandarin speech synthesis system, then obtainingExtracting text prosodic feature representation of the Chinese text by taking each Chinese character in the Chinese text as analysis granularity; extracting an audio prosodic feature representation of the mandarin audio; fusing the text prosody feature representation and the audio prosody feature representation to obtain fused prosody feature representation; and analyzing and predicting the fusion characteristic expression to obtain a prosodic boundary labeling sequence corresponding to each Chinese character in the Chinese text, and aligning and connecting the Chinese text and the prosodic boundary labeling sequence to obtain text data accurately labeled with prosodic boundaries. The text data is input into a voice synthesis system, and synthesized audio is obtained through output, wherein the synthesized audio is generated based on prosody labeling in the text data, and the prosody naturalness is improved. Alternatively, the synthetic audio may be any timbre of synthetic audio matched with the text data.

2、The method is applied to automatic feedback in reading/singing exercises.Schematically, obtaining pronunciation audio of a reading/singing practicer, firstly converting the pronunciation audio into an unlabelled exercise text through a voice recognition system, and extracting text rhythm characteristics of the unlabelled exercise text by taking each character in the unlabelled exercise text as analysis granularity; extracting audio prosody feature representation of the pronunciation audio; fusing the text prosody feature representation and the audio prosody feature representation to obtain fused prosody feature representation; through analysis and prediction of the fusion characteristic expression, a rhythm boundary marking sequence corresponding to each word in the unmarked exercise text is obtained, the unmarked exercise text and the rhythm boundary marking sequence are aligned and connected, the exercise text subjected to rhythm marking is obtained, the exercise text is compared with a standard text stored in an exercise library to obtain a comparison result, and the comparison result is sent to a speaking/singing practicer, so that the speaking/singing practicer can improve pronunciation and improve the speaking/singing level according to the comparison result.

It should be noted that the above application scenarios are only illustrative examples, and other application scenarios of the prosodic boundary labeling method are not limited in the embodiment of the present application.

The prosody boundary labeling method provided by the embodiment of the application can be realized by a terminal or a server independently, or can be realized by the terminal and the server together. The prosodic boundary labeling method is realized by the terminal and the server together. Fig. 2 is a schematic diagram of an implementation environment provided by an exemplary embodiment of the present application, as shown in fig. 2, the implementation environment includes a terminal 210, a server 220, and a communication network 230, wherein the terminal 210 and the server 220 are connected through the communication network 230.

In some alternative embodiments, the terminal 210 has a target application with a prosodic boundary labeling function installed and running therein. The target application program may be implemented as a speech synthesis application program, a speech recognition application program, a spoken language practice application program, a vehicle-mounted speech navigation application program, and the like, which is not limited in this embodiment of the present application. Illustratively, when prosodic boundary labeling needs to be performed on the target text, the target text and the target audio corresponding to the target text are input into the terminal 210 by the object, and the terminal 210 sends the target text and the target audio input by the object to the server 220.

In some optional embodiments, the server 220 is configured to provide a prosodic border marking service for a target application installed in the terminal 210, and the text encoder, the audio encoder and the multi-modal fusion decoder are disposed in the server 220. Illustratively, after receiving the target text and the target audio sent by the terminal 210, the server 220 inputs the target text into a text encoder to extract text prosody feature representation of the target text; inputting the target audio into an audio encoder to extract the audio prosody feature expression of the target audio; inputting the text prosody feature representation and the audio prosody feature representation into a multi-mode fusion decoder, so that the text prosody feature representation and the audio prosody feature representation are fused to obtain a fusion prosody feature representation, and performing prosody boundary prediction on a target text based on the fusion prosody feature representation to obtain a prosody boundary labeling result with the same length as the target text; the final server 220 feeds back the prosody boundary labeling result to the terminal 210, and optionally, the terminal 210 displays the prosody boundary labeling result.

In some optional embodiments, at least one of the audio encoder, the text encoder, and the multi-modal fusion decoder may also be disposed in the terminal 210, and the terminal 210 implements part or all of the prosodic boundary labeling process, which is not limited in this embodiment.

In some optional embodiments, the target application installed in the terminal 210 is further provided with a voice conversion function, where, for example, the object inputs the target text into the terminal 210, and the terminal 210 converts the target text input by the object into the target audio corresponding to the target text.

In some optional embodiments, the target application installed in the terminal 210 is further provided with a voice recognition function, where, for example, the object inputs the target audio into the terminal 210, and the terminal 210 converts the target audio input by the object into a target text corresponding to the target audio, where the target text is a text without prosodic boundary labels.

The terminal 210 includes at least one of a smart phone, a tablet computer, a portable laptop computer, a desktop computer, a smart speaker, a smart wearable device, a smart voice interaction device, a smart appliance, and a vehicle-mounted terminal.

It should be noted that the server 220 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a web service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), a big data and artificial intelligence platform, and the like.

Alternatively, server 220 may also be implemented as a node in a blockchain system.

It should be noted that the communication network 230 may be implemented as a wired network or a wireless network, and the communication network 230 may be implemented as any one of a local area network, a metropolitan area network, or a wide area network, which is not limited in this embodiment of the present invention.

It should be noted that the prosodic boundary marking service implemented by the server 220 may also be implemented in the terminal 210, which is not limited in this embodiment.

It should be noted that information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data for analysis, stored data, presented data, etc.), and signals referred to in this application are authorized by the user or sufficiently authorized by various parties, and the collection, use, and processing of the relevant data is required to comply with relevant laws and regulations and standards in relevant countries and regions. For example, the target text and target audio referred to in this application are both obtained with sufficient authorization.

With reference to the above description and implementation environments, a prosodic boundary labeling method provided in an embodiment of the present application is described, and fig. 3 is a flowchart of a prosodic boundary labeling method provided in an exemplary embodiment of the present application, which is described by taking as an example that the method is applied to the server 220 shown in fig. 2, and the method includes:

step 301, acquiring a target text and a target audio.

And matching the text content of the target text with the audio content of the target audio, wherein the target text is a text to be subjected to prosodic boundary identification.

Optionally, the target text is at least one of a chinese text, an english text, and the like, and the type of the language of the target text is not limited in the embodiment of the present application. Alternatively, the text content in the target text may include characters and punctuation marks, or may include only characters, for example, the target text may be "hello! Welcome you- "or" you welcome you "as well.

Optionally, the target audio is one of mandarin chinese audio, other chinese dialect audio, english pronunciation audio, english american pronunciation audio, and the like, and the embodiment of the present application does not limit the type of the language of the target audio.

Optionally, the text content of the target text and the audio content of the target audio are matched, illustratively, the target text is implemented as Chinese text and the target audio is implemented as Mandarin audio, if the text content of the target text is "hello! Welcome you! ", the audio content in the target audio is" hello!read in Mandarin! Welcome you! ". It is noted that the target text in one language may have multiple target audios matching it, wherein the multiple target audios are different in language, and illustratively, the text content of the target text is "hello! Welcome you! ", the audio content in the target audio may be" hello!read in Mandarin! Welcome you! ", or" you good!read in the cantonese dialect! Welcome you! ". Different target audios can be selected according to the language type of the speech synthesis system to be trained, and illustratively, if data for training the mandarin speech synthesis system is required to be acquired, the target audio corresponding to the chinese text is realized as the mandarin audio.

Optionally, the target text includes at least one continuous character segment requiring prosodic boundary recognition.

Alternatively, the prosodic boundary recognition is mainly to judge the type of the prosodic boundary between characters by semantic information corresponding to the characters or continuous character combinations, a pronunciation pause time between the characters, a pronunciation time of a single character, a low pronunciation pitch of a single character, a variation in pronunciation of continuous character combinations, and the like.

The prosodic boundaries include at least one of word boundaries, grammar word boundaries, prosodic phrase boundaries, and intonation phrase boundaries:

1. word boundaries are boundaries that divide characters in the target text.

Optionally, the word boundary is a boundary between adjacent characters in the target text.

Illustratively, a character in a chinese text is each chinese character in the text, and a word boundary in the chinese text is a boundary between a chinese character and a chinese character.

Optionally, each character in the target text corresponds to a word boundary identifier, wherein the word boundary identifier is used to indicate that a boundary between the corresponding character and a next character is a word boundary. Illustratively, "i" in "us" corresponds to a word boundary identifier, which means that the boundary between "i" and "s" is a word boundary.

Optionally, the labeling method for word boundary identification includes at least one of the following methods:

in the first method, the word boundary identifier is used as a basic identifier, and the grammar word boundary identifier, the prosodic phrase boundary identifier and the intonation phrase boundary identifier can modify the word boundary identifier.

Schematically, a target text is realized as a Chinese text for explanation, when prosodic boundary labeling is performed on the Chinese text, firstly, a character boundary identifier is labeled on each Chinese character in the Chinese text, then, the prosodic boundary recognition process is continued, and if a boundary between adjacent Chinese characters A and B is recognized as a grammatical word boundary, the character boundary identifier of the Chinese character A is modified and modified into the grammatical word boundary identifier.

And secondly, marking the boundary of the character for the target text by marking the boundary identifier of the grammar word, the boundary identifier of the prosodic phrase and the boundary identifier of the intonation phrase, and then automatically allocating the identifier for the target text.

Illustratively, a target text is realized as a Chinese text for explanation, a grammar word boundary, a prosodic phrase boundary and a intonation phrase boundary of the Chinese text are recognized, corresponding grammar word boundary identification, prosodic phrase boundary identification and intonation phrase boundary identification are labeled for Chinese characters in the Chinese text according to the recognized grammar word boundary, prosodic phrase boundary and intonation phrase boundary, and finally character boundary identification is labeled for Chinese characters which are not labeled.

For example: in the Chinese text 'we propose', if the boundary between the 'Chinese character' and the 'propose' is recognized as a grammatical word boundary, the 'Chinese character' is labeled with a grammatical word boundary identifier, and if the boundary between the 'out' and the next character is a prosodic word boundary, the 'out' is labeled with a prosodic word boundary identifier. And marking character boundary identification for the 'I' and the 'mention' if the 'I' and the 'mention' are not marked yet.

2. The grammar word boundaries are boundaries that divide grammar words in the target text.

Optionally, the grammar word is a word having independent semantics and composed of one or more than one character in the target text, illustratively, the target text is implemented as a chinese text for example, and the grammar word is a basic unit of a word in the chinese text and is used for determining pronunciations of each word in each word and distinguishing polyphonic characters.

Optionally, if the boundary between the adjacent C character and D character is recognized as a grammar word boundary, a grammar word boundary identifier is labeled to the C character. Optionally, the method for labeling the C character with the grammar word boundary identification includes at least one of the following methods:

in the first method, if the C character is not marked with the character boundary identification, the C character is directly marked with the grammar word boundary identification.

Illustratively, the boundary between "people" and "mention" in the Chinese text "we propose" is a grammatical word boundary, and grammatical word boundary identification is directly marked on "people".

And secondly, if the character boundary identifier is marked on the C character, modifying the character boundary identifier into a grammar word boundary identifier.

Illustratively, the boundary between the "middle" and the "middle" in the Chinese text "we propose" is the boundary of the grammatical word, and the "middle" is marked with the word boundary identifier, so that the word boundary identifier of the "middle" is modified into the boundary identifier of the grammatical word.

3. The prosodic word boundary is a boundary that divides prosodic words in the target text.

Alternatively, the prosodic word is a word with no pause in pronunciation composed of one or more grammatical words, that is, the prosodic word is a word composed of one or more grammatical words with continuous pronunciation.

Optionally, if the boundary between the adjacent E character and F character is recognized as a prosodic word boundary, labeling a prosodic word boundary identifier for the E character. Optionally, the method for labeling the prosodic word boundary identification on the E character comprises at least one of the following methods:

in the first method, if the character boundary identifier is not marked on the E character, the prosodic word boundary identifier is directly marked on the F character.

Illustratively, in the Chinese text, the boundary between the "out" and the "out" is used as the prosodic word boundary, and the prosodic word boundary identifier is directly marked on the "out".

And secondly, if the character boundary identifier is marked on the E character, modifying the character boundary identifier into a prosodic word boundary identifier.

Illustratively, in the Chinese text, the boundary between the "out" and the "out" is used as the prosodic word boundary, and the character boundary identifier is marked on the "out", so that the character boundary identifier of the "out" is modified into the prosodic word boundary identifier.

4. The prosodic phrase boundaries are boundaries that divide the prosodic phrase in the target text.

Optionally, the prosodic phrase is a phrase consisting of one or more prosodic words without a complete grammatical structure.

Optionally, if the boundary between the adjacent G character and H character is recognized as a prosodic phrase boundary, labeling a prosodic phrase boundary identifier for the G character. Optionally, the method for labeling the prosodic phrase boundary identification for the G character includes at least one of the following methods:

in the first method, if the G character is not marked with a character boundary identifier, a prosodic phrase boundary identifier is directly marked on the G character.

Illustratively, the Chinese text "we propose to label the boundary between" middle "device" and "note" of prosody with an automatic labeling device as a prosodic phrase boundary, and label the prosodic phrase boundary identification directly for the "device".

And secondly, if the G character is marked with a character boundary identifier, modifying the character boundary identifier into a prosodic phrase boundary identifier.

Illustratively, in the Chinese text, an automatic annotator is used for marking the boundary between a ' middle ' instrument ' and a ' note ' of the prosody as a prosodic phrase boundary, and a character boundary identifier is marked on the ' instrument ', so that the character boundary identifier of the ' instrument ' is modified into the prosodic phrase boundary identifier.

5. The intonation phrase boundaries are boundaries that divide the intonation phrases in the target text.

Optionally, the intonation phrase is a phrase with a complete grammar structure composed of one or more prosodic phrases. Illustratively, in a chinese text, a intonation phrase is a pronunciation that can be audibly separated into sentences, and generally corresponds to a syntactic sentence.

Optionally, if the boundary between the adjacent I character and J character is identified as a tone phrase boundary, a tone phrase boundary identifier is labeled to the I character. Optionally, the process of labeling the intonation phrase boundary identifier includes at least one of the following methods:

in the first method, if the I character is not marked with a character boundary identifier, the I character is directly marked with a tone phrase boundary identifier.

Illustratively, we propose labeling prosody with an automatic labeling machine, where the boundary between "middle" law "and" it "is the prosodic phrase boundary, and label the prosodic phrase boundary identification directly to" law ".

And secondly, if the I character is marked with a character boundary identifier, modifying the character boundary identifier into a tone phrase boundary identifier.

Illustratively, the Chinese text "is marked by an automatic marker, wherein the boundary between the" middle "law" and the "middle" law "is a prosodic phrase boundary, and a word boundary identifier is marked on the" law ", so that the word boundary identifier of the" law "is modified into the prosodic phrase boundary identifier.

Optionally, the manner of acquiring the target text and the target audio includes at least one of the following manners:

1. acquiring a target text; and acquiring target audio matched with the text content of the target text based on the target text.

Illustratively, the server acquires a target text needing prosodic boundary labeling, and acquires audio of the target text manually read recorded by the terminal as the target audio.

2. Acquiring a target audio; and acquiring a target text matched with the audio content of the target audio based on the target audio.

Schematically, after acquiring target audio needing prosodic boundary labeling, a server converts the target audio into a target text without prosodic labeling through a voice recognition system; or the server acquires a target audio needing prosodic boundary labeling, and acquires text content of the manually recognized target audio received by the terminal as a target text.

It should be noted that the above manner of obtaining the target text and the target audio is only an illustrative example, and the embodiment of the present application does not limit this.

Step 302, extracting text prosody feature representation of the target text by using characters as analysis granularity.

Optionally, the text prosodic feature is represented as a feature representation containing context information of the target text.

In some optional embodiments, wherein the context information comprises: the present embodiment does not limit any of the context semantic information, the context position information, the context character length information, and the like.

The characters can be realized as each Chinese character in a Chinese text and can also be realized as each word in an English text, and the method is illustrative by taking the target text as the Chinese text, analyzing each Chinese character in the Chinese text, extracting the context characteristics contained in each Chinese character, and taking the characteristic representation containing the context characteristics as the text prosody characteristic representation; the target text is realized as an English text for explanation, each word in the English text is analyzed, the context feature contained in each word is extracted, and the feature representation containing the context feature is used as the text prosody feature representation.

Optionally, the method further includes preprocessing the target text before extracting the text prosody feature representation of the target text, wherein the method of preprocessing includes at least one of the following methods:

1. and (5) processing redundant information.

Illustratively, the target text may contain some redundant information such as unnecessary spaces, repeated punctuation marks, unnecessary repeated words, etc., and the redundant information may be checked and deleted before extracting the text prosody feature representation of the target text.

2. And (5) correcting wrongly written characters.

Illustratively, if the target text may contain wrongly written characters, it is necessary to detect the wrongly written characters and then correct the wrongly written characters in the target text. For example: a sentence in the target text is "what is the true idea of the case? "in which" think "is wrongly written," really think "should be" true phase ", then need to change" think "in this sentence into" phase "; alternatively, a sentence in the target text is "I don't like applets", where "applets" is misspelled, and then "applets" in the sentence needs to be changed to "applets".

3. And (5) processing part-of-speech tags.

Optionally, the word in the target text is tagged with part of speech, where the part of speech refers to a part of speech of the word, and the tagging includes: nouns, adjectives, verbs, articles, conjunctions, pronouns, adverbs, numerators, prepositions, interjections, and the like.

Schematically, a target text is realized as a Chinese text for example, and word segmentation is performed on the target text to obtain a plurality of words; acquiring a word library marked with parts of speech; and matching the multiple word segments of the target text with the word library marked with the part of speech to obtain a part of speech marking result of the target text.

4. And (5) pre-clause processing.

Optionally, if the text content in the target text includes characters and punctuation marks, schematically, the text content in the target text may be subjected to sentence division processing according to periods, so as to divide the text content in the target text into a plurality of target sentences.

5. And (5) punctuation processing.

Optionally, punctuation marks may be included in the target text, and these punctuation marks may be marked or deleted from the target text.

It should be noted that the above pretreatment method is only an illustrative example, and the present application is not limited thereto.

And step 303, extracting the audio prosody feature representation of the target audio based on the analysis of the sound production content.

Alternatively, the audio prosody feature is represented as a feature representation containing prosody boundary related information of the target audio.

Optionally, the audio prosodic feature representation comprises: the embodiment of the present application does not limit the feature representation of the global prosody boundary related information of the target audio or the feature representation of the local prosody boundary related information of the target audio.

Optionally, extracting the audio prosody feature representation of the target audio further comprises: extracting a target feature representation of the target audio; and extracting the audio prosody feature representation of the target audio based on the extracted target feature representation.

Wherein the target feature representation comprises: at least one of time-domain feature representation, frequency-domain feature representation, pitch feature representation, intensity feature representation, duration feature representation, timbre feature representation and the like, which are all used for indicating the vocalization content of the target audio, are provided, and the number and the types of the target feature representations are not limited in the embodiments of the present application.

Illustratively, the target feature representation may be implemented as a frequency domain feature representation and a pitch feature representation of the target audio, and the process of extracting the audio prosody feature representation of the target audio includes: extracting frequency domain feature representation and pitch feature representation of the target audio; and extracting the audio prosody feature representation of the target audio based on the frequency domain feature representation and the pitch feature representation.

It should be noted that the step 302 and the step 303 may be two parallel steps, or may have a sequence, which is not limited in this application, that is, the step 302 may be executed first and then the step 303 is executed, the step 303 may be executed first and then the step 302 is executed, or the step 302 and the step 303 may be executed synchronously.

And step 304, fusing the text prosody feature representation and the audio prosody feature representation to obtain a fused prosody feature representation.

In some optional embodiments, the method for obtaining the fused prosodic feature representation includes at least one of the following methods:

1. and directly connecting the text prosody feature representation and the audio prosody feature representation to obtain a fusion prosody feature representation. The dimension of the fused feature representation after fusion is the sum of the dimension of the text prosody feature representation and the dimension of the audio prosody feature representation.

Illustratively, if the text prosody feature is represented as a, the dimension is a; and the audio prosody feature is represented as B, and the degree is B, so that the dimensionality of the fused prosody feature after fusion is a + B.

2. Performing dimension conversion on the audio prosody feature representation by taking the dimension of the text prosody feature representation as a target; and fusing the audio prosody feature representation after the dimension conversion with the text prosody feature representation to obtain a fused prosody feature representation.

Optionally, performing dimension conversion on the audio prosody feature representation to make the audio prosody feature representation have the same dimension as the text prosody feature representation, and fusing the audio prosody feature representation after the dimension conversion and the text prosody feature representation to obtain a fused prosody feature representation. Illustratively, if the text prosody feature is represented as C, the dimension is C; and D represents the audio prosody feature, and D represents the dimensionality of the fused prosody feature, namely c represents the dimensionality of the fused prosody feature.

The method for obtaining the fused prosody feature is only an exemplary example, and the embodiment of the present application is not limited thereto.

And 305, performing prosodic boundary prediction on the target text based on the fused prosodic feature representation to obtain a prosodic boundary labeling result with the same length as the target text.

The prosody boundary labeling result comprises prosody boundaries which are divided on the target text by taking characters as granularity.

Optionally, performing prosody boundary prediction on the target text based on the fused prosody feature representation to obtain a prosody boundary labeling sequence, wherein the prosody boundary labeling sequence corresponds to characters in the target text one to one; and aligning the prosodic boundary marking sequence with characters in the target text to obtain a prosodic boundary marking result.

Schematically, please refer to fig. 4, which shows a prosodic boundary labeling result 400, where the prosodic boundary corresponding to "i" is labeled as "CC", indicating that there is a word boundary between "i" and "s"; the prosodic boundary corresponding to the's' is marked as 'LW', and the boundary between the's' and the 'mentioned' is indicated as a grammatical word boundary; the prosodic boundary corresponding to the 'out' is marked as 'PW', and the prosodic word boundary is indicated between the 'out' and the 'use'; the prosodic boundary corresponding to the 'organ' is marked as 'PPH', and the prosodic phrase boundary between the 'organ' and the 'target' is indicated; the prosodic boundaries corresponding to "law" are labeled "IPH" indicating that the prosodic phrase boundaries are between "law" and the next word.

It should be noted that the word boundaries, grammar word boundaries, prosodic phrase boundaries, and intonation phrase boundaries are progressive layer by layer, for example: the positions in the target text for marking the prosodic phrase boundaries are also necessarily the positions of prosodic word boundaries, grammar word boundaries, and word boundaries. The target text may be prosody layered according to prosody boundaries in the target text. Illustratively, referring to fig. 5, a prosodic layering result 500 of the target text of fig. 4 is shown, wherein the intonation phrase comprises: "we propose to label prosody with an automatic labeler"; prosodic phrases include: "We propose using automatic annotator", "annotate rhythm"; the prosodic words include: "We propose", "with automatic annotator", "annotate rhythm"; the grammatical words include: "we", "propose", "use", "auto", "annotator", "annotate", "prosody"; the words include: "i", "a", "b", "c", "d", "e", and "e".

In summary, in the prosody boundary labeling method provided in the embodiment of the present application, the text prosody feature representation of the target text and the audio prosody feature representation of the target audio are fused to obtain a fused prosody feature representation, and prosody boundary prediction is performed on the fused prosody feature representation, because the audio prosody feature representation includes prosody boundary information, accuracy of prosody boundary labeling on the target text is improved; and the character is used as the granularity to analyze the target text and predict the prosodic boundary, so that the fine granularity of the prosodic boundary labeling result is improved, and the accuracy of prosodic boundary labeling on the target text is further improved.

According to the prosody boundary labeling method provided by the embodiment of the application, prosody boundary labeling is carried out on each character in the target text, and the labeled prosody boundary comprises at least one of a character boundary, a grammar word boundary, a prosody phrase boundary, a intonation phrase boundary and the like, so that the fine granularity of a prosody labeling result is improved.

In some alternative embodiments, the text prosodic feature representation is a feature extracted by a pre-trained text coder; and the audio prosody feature representation is a feature extracted by a pre-trained audio encoder. Fig. 6 is a flowchart of a prosodic boundary labeling method according to an exemplary embodiment of the present application, which is described by way of example as being applied to the server 220 shown in fig. 2, and includes:

step 601, acquiring a target text and a target audio.

Optionally, the target text is at least one of a chinese text, an english text, and the like, and the type of the language of the target text is not limited in the embodiment of the present application. The target audio is one of mandarin chinese audio, other chinese dialect audio, english pronunciation audio, english american pronunciation audio, and the like, and the embodiment of the present application does not limit the language types of the target text and the target audio.

Schematically, the target text is realized as a Chinese text, the target audio is realized as a Mandarin audio, and the server acquires the target text through the terminal; and acquiring audio data for the speaker recorded by the terminal to read the target text in Mandarin, thereby acquiring the target audio. Or the server acquires the target audio from the mandarin chinese audio data set; and the object converts the content in the target audio into Chinese text data in a manual identification mode, and after the Chinese text data is input to the terminal, the terminal uploads the Chinese text data to the server as a target text.

Step 602, performing character segmentation on the target text to obtain a plurality of character data in the target text.

Alternatively, a character refers to the smallest unit that can be segmented in the target text from a semantic perspective. For example: the words in the Chinese text and the words in the English text are separate parts and are the minimum units that make up a sentence.

Illustratively, if the target text is implemented as a chinese text, the word segmentation process is performed on the target text to obtain a set of multiple words in the target text. For example: the Chinese text is "we propose to label rhythm with automatic labeler", and the sentence is cut into a set of 14 Chinese characters, namely "I", "people", "propose", "go", "use", "self", "move", "label", "rhyme", "rhythm".

Illustratively, if the target text is realized as an english text, the target text is subjected to word segmentation processing to obtain a set of a plurality of words in the target text. For example: the English text is "We pro to label pro with automatic indicator", and the sentence is divided into a collection of 8 words, i.e., "We", "pro", "to", "label", "pro", "with", "automatic", and "indicator", according to the space between each word in English.

Step 603, extracting word vectors corresponding to the plurality of character data respectively.

Optionally, the word vector refers to an original vector representation corresponding to each character in the character data.

Schematically, a target text is realized as a Chinese text for explanation, and an original vector representation corresponding to each Chinese character in the Chinese text is extracted; alternatively, each Chinese character in the target text is converted into a one-dimensional vector, i.e., a word vector, by querying the word vector table.

Step 604, inputting the word vector into a text encoder, and outputting the text prosody feature representation of the target text.

Optionally, the text prosodic feature representation includes: at least one of a text vector, a position vector, and the like corresponding to the character data.

Schematically, explaining by taking an example that a target text is realized as a Chinese text, inputting a word vector into a text encoder, and acquiring a text vector corresponding to the word vector, wherein the text vector comprises global semantic information of the Chinese text and semantic information of Chinese characters corresponding to the word vector, and the text vector is used for indicating the specific semantics of the Chinese characters corresponding to the word vector in the Chinese text; acquiring a position vector corresponding to the Chinese character based on the Chinese text, wherein the position vector is used for indicating the position information of the Chinese character corresponding to the word vector in the Chinese text; finally, the sum of the character vector, the text vector and the position vector is used as the character rhythm characteristic representation corresponding to the Chinese character; and outputting a set of character prosody feature representations corresponding to each Chinese character in the Chinese text as the text prosody feature representations of the Chinese text, or splicing the character prosody feature representations corresponding to each Chinese character in the Chinese text as the text prosody feature representations of the Chinese text.

Optionally, the length of the prosodic feature representation of each chinese character in the chinese text is the same.

In some optional embodiments, the word prosodic feature representation may be implemented as a weighted word prosodic feature representation, and optionally, the weighting method of the word prosodic feature representation includes: inquiring a word vector weight table based on the word vector corresponding to the character; acquiring the weight corresponding to the word vector; and performing weighting processing on the character prosody feature representation corresponding to the character vector based on the weight to obtain weighted character prosody feature representation. The word vector weight table is used for indicating the importance of characters corresponding to the word vectors in the target text.

It should be noted that the text encoder is an encoder pre-trained by a corpus of text.

Illustratively, the text encoder includes: at least one of a Long Short-Term Memory (LSTM), a Bidirectional encoder characterization from transforms (BERT) Model, a generative Pre-Training Model (GPT), a RoBERTa Model, etc., which is not limited in the embodiments of the present application. The following description is made with a text encoder implemented as the BERT model.

Illustratively, the BERT model is a pre-trained text coder, and if the target text is implemented as a chinese text, the pre-trained data set of the BERT model is a chinese corpus. Optionally, the initial value parameters of the text vector are determined based on a pretraining process of the BERT model.

Step 605, extracting the frequency domain feature representation and pitch feature representation of the target audio.

The frequency domain feature representation and the pitch feature representation are used to indicate the voicing content of the target audio.

Illustratively, the frequency domain feature representation includes: at least one of a Filter bank (Fbank) characteristic, a Mel Frequency Cepstrum Coefficient (MFCC) characteristic, etc., which is not limited in the embodiments of the present application.

Schematically, the specific process of extracting the frequency domain feature representation is as follows:

optionally, performing framing processing on the target audio to obtain a plurality of time frames; respectively extracting corresponding sub-frequency domain feature representations of each time frame in the target audio; and splicing the sub-frequency domain feature representations corresponding to all time frames in the target audio to obtain the frequency domain feature representation of the target audio, or taking the set of the sub-frequency domain feature representations as the frequency domain feature representation of the target audio.

Illustratively, the specific process of extracting pitch feature representation is as follows:

optionally, performing framing processing on the target audio to obtain a plurality of time frames; extracting corresponding sub-pitch characteristic representation of each time frame in the target audio respectively; and splicing the sub-pitch characteristic representations corresponding to all time frames in the target audio to obtain the pitch characteristic representation of the target audio, or taking the set of the sub-pitch characteristic representations as the pitch characteristic representation of the target audio.

And 606, splicing the frequency domain feature representation and the pitch feature representation to obtain target feature representation.

Optionally, the manner of obtaining the target feature representation includes at least one of the following manners:

1. carrying out weighted summation on the frequency domain feature representation and the pitch feature representation to obtain target feature representation;

2. acquiring a product of the frequency domain feature representation and the pitch feature representation as a target feature representation;

3. acquiring Cartesian products of frequency domain feature representation and pitch feature representation as target feature representation;

4. the set of frequency domain feature representations and pitch feature representations are taken as target feature representations.

It should be noted that the above method for obtaining the target feature is only an illustrative example, and the embodiment of the present application is not limited thereto.

And step 607, inputting the target feature representation into an audio encoder, and outputting to obtain the audio prosody feature representation of the target audio.

Wherein the audio encoder is an encoder pre-trained by a speech data set.

Optionally, the method for obtaining the audio prosody feature representation of the target audio includes at least one of the following methods:

1. inputting the target feature representation into an audio encoder to obtain a first voice posterior probability graph, wherein the first voice posterior probability graph is used for indicating the phoneme level posterior probability of the target audio; and outputting the audio prosody feature representation of the target audio based on the first voice posterior probability graph.

Optionally, the first speech posterior probability map is output as an audio prosody feature representation of the target audio.

Illustratively, the audio encoder may be implemented as a phoneme level-based speech posterior probability map extractor, and the target feature representation is input into the phoneme level-based speech posterior probability map extractor, and the first speech posterior probability map is output as the audio prosodic feature representation of the target audio.

Wherein in the first speech posterior probability map, the abscissa is used to indicate the time line of the target audio, the ordinate is used to indicate the class of phonemes, each coordinate point in the map is used to indicate the posterior probability size of the phoneme of the class occurring at a given point in time, and the darker the color at each coordinate point, the greater the probability.

Illustratively, if the target audio is implemented as mandarin chinese audio, the phone-level-based speech posterior probability map extractor pre-trains on the mandarin chinese speech data set, and the pre-training process of the phone-level-based speech posterior probability map extractor is explained as follows:

where the pre-training is targeted to 218 frame-level phones that are context-free.

Schematically, inputting the speech in the ordinary speech sound data set into a Gaussian Mixture Model-Hidden Markov Model (GMM-HMM) to obtain a phoneme corresponding to the mandarin speech sound data set; the phoneme is used as training data, and the cross entropy is used as a loss function to train the phoneme-level-based speech posterior probability map extractor.

2. Inputting the target feature representation into an audio encoder to obtain a second voice posterior probability graph, wherein the second voice posterior probability graph is used for indicating the word-level posterior probability of the target audio; and outputting the audio prosody characteristic representation of the target audio based on the second voice posterior probability graph.

The audio encoder may be implemented as a word-level-based speech recognition model, and optionally, the target feature representation is input into the word-level-based speech recognition model and output to obtain the second speech posterior probability map as the audio prosody feature representation of the target audio.

Schematically, the target text is implemented as a Chinese text, wherein in the second speech posterior probability graph, the abscissa is used for indicating the time line of the target audio, the ordinate is used for indicating the type of the Chinese character, each coordinate point in the graph is used for indicating the posterior probability size of the Chinese character appearing at a given time point, and the deeper the color at each coordinate point, the greater the probability.

In some optional embodiments, the dividing the target audio into a plurality of audio segments for analysis, respectively, where the target feature representation includes a segment feature representation corresponding to the target audio segment, and the obtaining the second speech posterior probability map further includes:

inputting the segment feature representation into an audio encoder to obtain a posterior probability subgraph corresponding to the segment feature representation; and integrating the posterior probability subgraphs respectively corresponding to the plurality of audio clips to obtain a second voice posterior probability graph.

Illustratively, if the target audio is implemented as mandarin chinese audio, the word-level based speech recognition model is pre-trained on a mandarin chinese speech data set, optionally with a Connection Timing Classification (CTC) as a loss function end-to-end training to pre-train the word-level based speech recognition model, the pre-training process being as follows:

inputting the voice in the common speech sound data set into a voice recognition model based on a word level to obtain prediction data; obtaining CTC loss based on the predicted data and pre-acquired real data; a word-level based speech recognition model is trained based on CTC loss.

Step 608, the text prosody feature representation and the audio prosody feature representation are fused to obtain a fused prosody feature representation.

Optionally, performing dimension conversion on the audio prosody feature representation by an attention mechanism to make the audio prosody feature representation have the same dimension as that of the text prosody feature representation, and fusing the audio prosody feature representation after the dimension conversion and the text prosody feature representation to obtain a fused prosody feature representation.

Illustratively, the text prosody feature representation and the audio prosody feature representation are implemented as a text prosody feature vector and an audio prosody feature vector, the text prosody feature vector and the audio prosody feature vector are input into the multi-modal fusion model to obtain a fusion prosody feature vector, and the following describes a process of obtaining the fusion prosody feature vector:

referring to fig. 7, the multi-modal fusion model 700 includes a first network layer 710 and a second network layer 720, wherein the first network layer 710 includes a multi-head self-attention layer, a first forward propagation layer, and a first linear layer; second tier network layers 720 include multiple cross attention layers.

Firstly, inputting an audio prosody feature vector 711 into a first network layer 710, and outputting to obtain an audio prosody feature vector 712 with the same dimension as a text prosody feature vector 721; inputting the audio prosody feature vector 712 and the text prosody feature vector 721 having the same dimension as the text prosody feature vector 721 into the second network layer 720, wherein the audio prosody feature vector 712 and the text prosody feature vector 721 having the same dimension as the text prosody feature vector 721 are fusion-calculated in the multi-head cross attention layer, the text prosody feature vector 721 is used as a query value (query), and the audio prosody feature vector 712 having the same dimension as the text prosody feature vector 721 is used as a key value (key) and a value (value), and the specific calculation formula is as follows:

the formula I is as follows: q _x ,K _o ,V _o ＝W _Q X,W _k O,W _v O

The formula II is as follows:

in the formula I, O is ═ O ₁ ,…,o _T ]∈R ^T×D And X ═ X ₁ ,…,x _T ]∈R ^N×D Audio and text-side inputs representing a multi-headed cross attention layer, respectively; q _x ,K _o ,V _o Respectively representing query input of a text end, key input of an audio end and value input of the audio end; w _Q ,W _k ,W _v Respectively represent Q _x ,K _o ,V _o A trainable matrix of (a).

In the formula II, H is belonged to R ^N×D Is a multiple headThe output of the cross attention layer, namely the fused prosodic feature vector; d represents the dimension of the fused prosodic feature vector; softmax is the activation function.

Finally, outputting the obtained H epsilon R ^N×D As the fused prosodic feature vector 722.

And step 609, performing prosodic boundary prediction on the target text based on the fused prosodic feature expression, and determining a prosodic boundary type corresponding to the characters in the target text.

The prosodic boundary type includes at least one of a word (CC) boundary, a grammar word (LW) boundary, a Prosodic Word (PW) boundary, a Prosodic Phrase (PPH) boundary, and an Intonation Phrase (IPH) boundary, which is not limited in this application.

And step 610, marking characters in the target text by the prosodic boundary type to obtain a prosodic boundary marking result with the same length as the target text.

Optionally, the characters in the target text are labeled by the prosody boundary type to obtain a prosody boundary labeling sequence, wherein the prosody boundary labeling sequence corresponds to the characters in the target text one by one; and aligning the prosodic boundary marking sequence with characters in the target text to obtain a prosodic boundary marking result.

Illustratively, referring to fig. 4, if the prosodic boundary corresponding to "i" is identified as "CC", it represents that the type of prosodic boundary between "i" and "i" is a word boundary.

In an optional embodiment, the method for obtaining a prosodic boundary labeling result with the same length as the target text further includes:

performing prosodic boundary prediction on the target text based on text prosodic feature representation to obtain a prosodic boundary text feature prediction result with the same length as the target text; performing prosodic boundary prediction on the target text based on the audio prosodic feature representation to obtain a prosodic boundary audio feature prediction result with the same length as the target text; and obtaining a prosodic boundary labeling result with the same length as the target text based on the prosodic boundary text feature prediction result and the prosodic boundary audio feature prediction result.

Schematically, a text feature labeling result of a prosody boundary for text prosody feature representation prediction and an audio prosody feature representation predicted prosody boundary audio feature labeling result are respectively obtained, and a labeling result with high prosody naturalness is selected as a final prosody boundary labeling result.

In an optional embodiment, the method for obtaining a prosodic boundary annotation result with the same length as the target text further includes:

acquiring a target text and a target audio; in response to the fact that the number of characters in the target text is smaller than or equal to a preset character threshold value, extracting text prosody feature representation of the target text; and performing prosodic boundary prediction on the target text based on the text prosodic feature representation to obtain a prosodic boundary prediction result with the same length as the target text. Or, in response to the number of characters in the target text being greater than a preset character threshold, extracting text prosody feature representation of the target text and extracting audio prosody feature representation of the target audio; fusing the text prosody feature representation and the audio prosody feature representation to obtain fused prosody feature representation; and performing prosodic boundary prediction on the target text based on the fusion prosodic feature expression to obtain a prosodic boundary labeling result with the same length as the target text.

That is to say, before performing prosodic boundary prediction on a text, the number of characters in the text is judged, if the number of characters is small and is less than or equal to a preset character threshold, the prosodic boundary in the text is represented to be simpler, and a prosodic boundary labeling result can be obtained directly by performing semantic analysis on the text; if the number of characters in the text is large and larger than a preset character threshold value, a prosody boundary labeling result is obtained by performing joint analysis on the semantic information of the text and prosody boundary related information in the audio.

According to the prosody boundary labeling method provided by the embodiment of the application, prosody boundary labeling is performed on each character in the target text, and the labeled prosody boundary comprises at least one of a character boundary, a grammar word boundary, a prosody phrase boundary, a intonation phrase boundary and the like, so that the fine granularity of a prosody labeling result is improved.

According to the prosody boundary labeling method provided by the embodiment of the application, the text prosody feature representation of the target text and the audio prosody feature representation of the target audio are respectively extracted through the pre-trained text encoder and the pre-trained audio encoder, so that the accuracy and the generalization of the prosody labeling method are improved.

Fig. 8 is a flowchart of a prosodic boundary labeling method according to an exemplary embodiment of the present application, which is described by way of example as being applied to the server 220 shown in fig. 2, and includes:

step 801, obtaining candidate texts and a preset sentence library.

The candidate text comprises a plurality of target sentences, and the preset sentence library comprises a plurality of preset sentences.

Illustratively, the preset target text may be a long text, and the preset target text includes a plurality of complete sentences; the preset sentences in the preset sentence library include commonly used sentences, daily fixed pronunciation rhythm sentences (i.e., sentences with basically unchanged prosodic boundary labels), and the like.

Optionally, the obtaining of the preset target text further includes: and performing clause processing on a preset target text. Illustratively, there may be obvious segmentation signs (for example, punctuations such as periods, commas, question marks, and the like) between a plurality of target sentences in the preset target text, and the preset target text is subjected to sentence division processing according to the segmentation signs to obtain a plurality of sentences.

Optionally, when the preset target text is subjected to sentence splitting processing, the target sentences are numbered sequentially according to the front and rear positions of the target sentences in the preset target text.

Step 802, matching the multiple target sentences with preset sentences in a preset sentence library respectively to obtain multiple sentence matching results.

Illustratively, the obtained target sentences are respectively matched with preset sentences in a preset sentence library. If a target statement is "hello! ", the preset sentence library stores" you good! "this sentence and" hello! The prosody labeling result of the sentence represents that the target sentence is successfully matched, optionally, all characters and punctuation marks of one target sentence are in one-to-one correspondence with the characters and punctuation marks of the preset sentence in the preset sentence library to represent that the target sentence is successfully matched, otherwise, the matching fails. When all the target sentences are matched, the sentence matching results are divided into target sentences which are matched successfully and target sentences which are matched unsuccessfully.

And 803, screening the multiple sentence matching results to obtain a target sentence which is failed to match with the preset sentence in the preset sentence library in the candidate text and serves as the target text.

The target text is a text to be subjected to prosodic boundary recognition.

Illustratively, a plurality of sentence matching results are screened, all the target sentences which fail to be matched are used as screened sentence matching results, and the target sentences which fail to be matched are recombined according to the numbering sequence to obtain the target text.

Optionally, the target sentence which is successfully matched is stored in the temporary storage space, and when the prosody boundary labeling process of the target sentence which is not successfully matched in the target text is finished, the target sentence which is successfully matched and the target sentence with the prosody boundary labeled are spliced according to the sentence numbering sequence.

And step 804, acquiring the audio matched with the target text as the target audio.

Illustratively, the audio data corresponding to the target sentence with failed matching is extracted as the target audio.

Step 805, acquiring a preset phrase library.

The preset phrase library comprises a plurality of preset phrases.

Illustratively, the predetermined phrase includes a fixed prosodic word, a common prosodic word, a special prosodic word, and the like.

Step 806, matching the target text with a preset phrase in a preset phrase library to obtain a reference matching result.

Illustratively, the target text is matched with fixed prosodic words, common prosodic words, special prosodic words and the like in a preset phrase library to obtain the target text with the reference prosodic boundary marked as a reference matching result.

Step 807, extracting text prosody feature representation of the target text by using the characters as analysis granularity.

Optionally, performing character segmentation on the target text to obtain a plurality of character data in the target text; extracting word vectors corresponding to the plurality of character data respectively; and inputting the word vector into a text encoder, and outputting text prosody feature representation of the target text. The text encoder is obtained by pre-training a text corpus.

And 808, extracting the audio prosody feature expression of the target audio based on the analysis of the sound production content.

Optionally, extracting a frequency domain feature representation and a pitch feature representation of the target audio; splicing the frequency domain feature representation and the pitch feature representation to obtain target feature representation; and inputting the target feature representation into an audio encoder, and outputting to obtain the audio prosody feature representation of the target audio. Wherein the audio encoder is an encoder pre-trained by a speech data set.

Optionally, the audio encoder is a phoneme level based encoder or the audio encoder is a word level based encoder.

And step 809, fusing the text prosody feature representation and the audio prosody feature representation to obtain a fused prosody feature representation.

And 810, performing prosodic boundary prediction on the target text based on the fused prosodic feature expression to obtain a prediction result.

Optionally, performing prosody boundary prediction on the target text based on the fused prosody feature expression, and determining a prosody boundary type corresponding to characters in the target text; and marking characters in the target text by a prosodic boundary type to obtain a prediction result with the same length as the target text.

And 811, obtaining a prosodic boundary labeling result with the same length as the target text based on the prediction result and the reference matching result.

In some optional embodiments, the method for obtaining a prosodic boundary annotation result with the same length as the target text includes:

1. and adjusting the prediction result based on the reference matching result to obtain a prosodic boundary labeling result with the same length as the target text.

In an optional embodiment, the reference matching result includes a prosody boundary reference labeling sequence, and the prediction result includes a prosody boundary prediction labeling sequence; comparing the prosody boundary reference marking sequence with the prosody boundary prediction marking sequence to obtain a prosody boundary comparison result, wherein the prosody boundary comparison result is used for indicating the difference between the reference matching result and the prediction result; and determining a prosodic boundary labeling result with the same length as the target text based on the prosodic boundary comparison result.

Optionally, the preset phrases in the preset phrase library correspond to weights, and the prosodic boundary reference labeling sequence includes a first prosodic boundary identifier; then, obtaining a prosodic boundary labeling result with the same length as the target text based on the prosodic boundary comparison result includes:

in response to the first prosodic boundary identifier not existing in the prosodic boundary prediction annotation sequence, acquiring a phrase indicated by the first prosodic boundary identifier in the target text; matching the phrases with a preset phrase library, and acquiring the weight matched with the phrases; and updating the prosodic boundary prediction labeling sequence to obtain a prosodic boundary labeling result in response to the weight matched with the phrase reaching a preset weight threshold.

Schematically, comparing the difference between the prediction result and the reference matching result on the prosody boundary labeling sequence; if the prosodic boundary identifier of the reference matching result is different from the prediction result at a certain position of the target text, acquiring a phrase indicated by the prosodic boundary identifier at the position; and inquiring the corresponding weight of the phrase in a preset phrase library, and if the weight reaches a preset weight threshold, replacing the prosodic boundary identifier of the prediction result at the position with the prosodic boundary identifier of the reference matching result at the position.

2. And selecting the reference matching result or the prediction result as a prosodic boundary labeling result.

Schematically, scoring the reference matching result and the test result in a scoring mode, and if the score of the reference matching result is higher than that of the test result, selecting the reference matching result as a rhythm boundary marking result; and if the reference matching result is lower than the score of the test result, selecting the test result as a prosodic boundary marking result.

According to the prosody boundary labeling method provided by the embodiment of the application, before prosody boundary prediction is performed on the target text, the sentences which are failed to be matched with the preset sentences in the preset sentence library in the target text are screened out and analyzed, so that the calculated amount of the prosody boundary labeling method is reduced; and the prosodic boundary labeling result is obtained by comparing the reference matching result with the prediction result, so that the accuracy of prosodic boundary labeling on the target text is improved.

Fig. 9 is a prosodic boundary labeling model according to an exemplary embodiment of the present application, please refer to fig. 9, in which the prosodic boundary labeling model 900 includes a text encoder 910, an audio encoder 920, and a multi-mode fusion decoder 930, and is configured to analyze an input text and an input audio to obtain a prosodic boundary labeling result with a length equal to that of the text. The following describes specific implementation steps of obtaining a prosodic boundary annotation result through the prosodic boundary annotation model 900, as shown in fig. 9:

s1: the server obtains the Chinese text 911 'we propose to label prosody with an automatic labeler' input at the terminal; and acquiring mandarin voice audio 921 of speaker reading recorded through the terminal, wherein the mandarin voice audio is marked with rhythm by an automatic marker.

S2: dividing the Chinese text 911 into text data consisting of each Chinese character, and acquiring an original character vector of the text data consisting of each Chinese character; inputting a word vector set 912 corresponding to all Chinese characters in a Chinese text 911 into a text encoder 910, wherein the text encoder 910 is implemented as a pre-trained Chinese BERT encoder, and the text encoder 910 performs pre-training in a 300GB news corpus; the fixed-length text prosody feature vector 913 containing the context features is obtained through output.

Illustratively, the specific process of extracting the text prosody feature vector 913 through the pre-trained chinese BERT encoder is as follows: inputting the word vector set 912 into the text encoder 910, and acquiring a text vector set corresponding to the word vector set 912, wherein the text vector set includes global semantic information of the chinese text 911 and semantic information of the chinese character corresponding to the word vector set 912, and is used to indicate the specific semantics of the chinese character corresponding to the word vector set 912 in the chinese text 911; and acquiring a position vector set corresponding to the Chinese character based on the Chinese text 911, wherein the position vector set is used for indicating position information of the Chinese character corresponding to the character vector set 912 in the Chinese text 911; finally, the sum of the character vector, the text vector and the position vector corresponding to each Chinese character is respectively used as a character rhythm feature vector corresponding to the Chinese character, and the length of the character rhythm feature vector of each Chinese character is fixed; the set of word prosodic feature vectors corresponding to each Chinese character in the Chinese text 911 is output as the text prosodic feature vector 913 for the Chinese text 911.

S3: dividing the mandarin audio 921 into a plurality of time frames, acquiring 80-dimensional FBank characteristics and 3-dimensional pitch characteristics of each time frame, and splicing the 80-dimensional FBank characteristics and the 3-dimensional pitch characteristics of each time frame to obtain time frame input characteristics; the time-frame input feature set 922 corresponding to all the time frames in the mandarin chinese audio 921 is input to the audio encoder 920.

Wherein the audio encoder includes: the method comprises a voice posterior probability map extractor based on a convolution enhanced converter (former) structure, a voice recognition model based on a Convolution Neural Network (CNN) structure and a voice recognition model based on a former structure. The following describes the process of extracting the audio prosodic feature vector 923 of the mandarin audio 921 when the three models are implemented as an audio encoder:

(1) the audio encoder 920 is implemented as a speech posterior probability map extractor based on a former structure.

It should be noted that the former-structure-based speech posterior probability map extractor is a speaker-independent frame-level classifier, which maps each input time frame to the posterior probability of a phoneme class, and the extracted speech posterior probability map can represent the duration and conversion information of each phoneme in the audio. The model structure of the speech posterior probability map extractor based on the context structure is composed of 2 convolution layers and 12 context modules.

Illustratively, the set 922 of the temporal frame input features corresponding to all the temporal frames in the mandarin chinese audio 921 is input into the audio encoder 920, and the temporal frame input features corresponding to each temporal frame are mapped to a temporal frame posterior probability map of a phoneme category through 2 convolutional layers and 12 former modules, wherein the abscissa of the temporal frame posterior probability map is used to indicate the time line of each temporal frame, the ordinate is used to indicate the category of the phoneme, each coordinate point in the map is used to indicate the posterior probability size of the phoneme of the category appearing at a given time point, and the deeper the color at each coordinate point is, the greater the probability is; a set of time-frame posterior probability maps corresponding to all time frames in the mandarin chinese audio 921 is obtained as the audio prosodic feature vector 923 of the mandarin chinese audio 921.

It is noted that the former-based speech posterior probability map extractor is an audio encoder pre-trained on a WenetSpeech data set of 10k hours, and the pre-training process of the former-based speech posterior probability map extractor is described as follows:

Schematically, inputting 10k hours of speech in the WenetSpeech data set into a GMM-HMM model to obtain 10k hours of phonemes corresponding to the WenetSpeech data set; and training the voice posterior probability graph extractor based on the former structure by taking the phoneme as training data and taking the cross entropy as a loss function.

However, the above-mentioned speech posterior probability map does not consider context features at a word level, but information at a word and word level is important for prediction of prosodic boundaries. For example, the sequences "university biology, compulsory lesson" and "university student, compulsory lesson" have the same phoneme sequence, but have different prosodic boundaries. Therefore, the speech posterior probability map considering only the phoneme information may not achieve an optimal prediction effect. Thus, two word-level speech recognition models are provided for prosodic boundary prediction, which are based on CNN and former structures, respectively.

(2) The audio encoder 920 is implemented as a speech recognition model based on the CNN structure.

It should be noted that the CNN-based speech recognition model maps each input time frame to the posterior probability of the word class, which retains the information at the word level, and focuses on the local information, which is composed of 2 convolutional layers and 1 linear layer.

Illustratively, inputting the time-frame input feature set 922 corresponding to all time frames in the mandarin chinese audio 921 into the audio encoder 920, the time-frame input feature corresponding to each time frame would be mapped to a word-level time-frame posterior probability map through 2 convolutional layers and 1 linear layer, wherein the abscissa of the time-frame posterior probability map is used to indicate the time line of each time frame, the ordinate is used to indicate the category of the word, each coordinate point in the map is used to indicate the posterior probability size of the word occurring at a given time point, and the darker the color at each coordinate point, the greater the probability; the set of vectors of the 512-dimensional hidden layer at the penultimate layer in the speech recognition model based on the CNN structure is output as the audio prosodic feature vector 923 of the mandarin audio 921.

It is noted that the CNN-based speech recognition model is an audio encoder pre-trained on a 10 k-hour WenetSpeech data set, and the following describes the pre-training process of the CNN-based speech recognition model:

inputting the voice in the WenetSpeech data set of 10k hours into a voice recognition model based on a CNN structure to obtain prediction data; obtaining CTC loss based on the predicted data and pre-acquired real data; and training a voice recognition model based on the CNN structure based on the CTC loss.

(3) The audio encoder 920 is implemented as a context-based speech recognition model.

It should be noted that the context-based speech recognition model maps each input time frame to a posterior probability of a word class, retains information at the word level, and emphasizes the entire speech information contained in mandarin chinese audio 921, which consists of 2 convolutional layers, 12 context modules, and 1 linear layer.

Illustratively, the time-frame input feature set 922 corresponding to all time frames in the mandarin chinese audio 921 is input into the audio encoder 920, and the time-frame input feature corresponding to each time frame is mapped into a word-level time-frame posterior probability map through 2 convolutional layers, 12 former modules and 1 linear layer, wherein the abscissa of the time-frame posterior probability map is used for indicating the time line of each time frame, the ordinate is used for indicating the category of a word, each coordinate point in the map is used for indicating the posterior probability size of the word occurring at a given time point, and the deeper the color at each coordinate point is, the greater the probability is; and outputting a vector set of the 512-dimensional hidden layer of the penultimate layer in the speech recognition model based on the provider structure as an audio prosody feature vector 923 of the Mandarin audio 921.

It is noted that the context-based speech recognition model is an audio encoder pre-trained on a WenetSpeech data set of 10k hours, and the pre-training process of the context-based speech recognition model is described as follows:

inputting the voice in the WenetSpeech data set of 10k hours into a voice recognition model based on a former structure to obtain prediction data; obtaining CTC loss based on the predicted data and pre-acquired real data; training a conformer structure-based speech recognition model based on the CTC loss.

S4: the audio prosodic feature vectors 923 are input into the multimodal fusion decoder.

Since the audio feature vectors at the frame level are much longer than the text feature vectors at the word level, a cross-attention structure is used to solve this problem.

The multi-modal convergence decoder includes a first network layer 931, a first linear layer 932, a second network layer 933, and a second linear layer 934, where the first network layer 931 includes 6 identical network layers, each network layer includes a multi-head self-attention layer and a forward propagation layer, and only one network layer is shown in fig. 9 for illustration.

The audio prosody feature vector 923 is input into 6 same network layers stacked in the first network layer 931, and dimension conversion is performed on the audio prosody feature vector 923 through the multi-head self-attention layer and the forward propagation layer and the linear layer 932, so that an audio prosody feature vector 935 having the same dimension as the text prosody feature vector 913 is obtained.

S5: the audio prosody feature vector 935 and the text prosody feature vector 913, which have the same dimensions as the text prosody feature vector 913, are input to the second network layer 933.

The second network layer 933 includes 6 identical network layers, each of which includes a multi-head cross attention layer and a forward propagation layer, and only one network layer is shown in fig. 9 for illustration.

Inputting audio prosody feature vectors 935 and text prosody feature vectors 913 having the same dimension as the text prosody feature vector 913 into 6 same network layers in a second network layer 933, performing fusion calculation on the audio prosody feature vectors 935 and the text prosody feature vectors 913 having the same dimension as the text prosody feature vector 913 in a multi-head cross attention layer, where the text prosody feature vector 913 serves as a query value (query), and the audio prosody feature vectors 913 having the same dimension as the text prosody feature vector 935 serve as a key value (key) and a value (value), and the specific calculation formula is as follows:

the formula I is as follows: q _x ,K _o ,V _o ＝W _Q X,W _k O,W _v O

The formula II is as follows:

in the formula I, O is ═ O ₁ ,…,o _T ]∈R ^T×D And K ═ x ₁ ,…,x _T ]∈R ^N×D Respectively represent a pluralityAudio and text side input of the cross-head attention layer; q _x ,K _o ,V _o Respectively substituting query input of a text end, key input of an audio end and value input of the audio end; w _Q ,W _k ,W _v Each represents Q _x ,K _o ,V _o A trainable matrix of (a).

In the formula II, H is belonged to R ^N×D Is the output of the multi-head cross attention layer, namely the fusion prosodic feature vector; d represents the dimension of the fused prosodic feature vector; softmax is the activation function.

Notably, this multi-head cross attention allows the multi-modal fusion decoder to automatically learn the alignment of the audio prosody feature vectors 935 and the text prosody feature vectors 913 in the same dimensions as the text prosody feature vectors 913.

S6: h epsilon R obtained by multi-head cross attention layer output ^N×D As a fused prosody feature vector, inputting the fused prosody feature vector into the second linear layer 934, and obtaining a prosody boundary labeling sequence 936: "CC LW CC PW LW CC LW CC CC PPH CC LW CC IPH".

Illustratively, each of the prosodic boundary labeling sequences 936 corresponds to each Chinese character in the Chinese text 911 "labeling prosody with an automatic labeler we propose" for indicating the prosodic boundary type on the right side of each Chinese character.

The following describes the automatic index evaluation and manual evaluation results of the prosodic boundary labeling method provided in the embodiment of the present application, and specifically introduces the following:

first, the evaluated data set is presented. The data set was 12.2k utterances (about 160 hours) with audio recorded from a total of 28 different speakers, 95% of which were used as the training set and 5% as the verification set. 5.9k utterances (about 8.8 hours) constitute a test set, the audio of which was recorded by 9 other speakers, these 9 persons not coinciding with the 28 persons mentioned previously.

1. And (6) automatic index evaluation.

Referring to fig. 10, evaluation index data of the first to fifth methods are shown.

It should be noted that, the first method, the third method, the fourth method and the fifth method belong to automatic prosodic boundary labeling models and are obtained by training through the training set; and the second method is manual marking by seven marking personnel. The data in fig. 10 are prosodic boundary labeling result scores on the test set by method one, method two, method three, method four, and method five.

The first method is a prosody boundary labeling method based on text input, the third method is a prosody boundary labeling method for realizing the audio encoder provided by the embodiment of the application as a voice recognition model based on a CNN structure, the fourth method is a prosody boundary labeling method for realizing the audio encoder provided by the embodiment of the application as a voice recognition model based on a former structure, and the fifth method is a prosody boundary labeling method for realizing the audio encoder provided by the embodiment of the application as a voice posterior probability graph extractor based on a former structure.

Wherein, the first condition represents whether the audio encoder is pre-trained, the second condition represents whether the audio encoder freezes the parameters in the training process of the automatic prosody boundary labeling model, and the "-" represents that the audio encoder is not included.

Where "pre.", "rec.", "F1" represent accuracy, recall, and balance F scores on the test set, respectively.

The prosodic boundaries in the Chinese text are divided into five levels, from low to high, which are characters (CC), grammatical words (LW), Prosodic Words (PW), Prosodic Phrases (PPH), and Intonation Phrases (IPH), and the prediction result scores of the latter four prosodic boundaries are shown in FIG. 10. As can be seen from the results in the figure, the f1 scores for the grades "LW" and "IPH" are substantially above 0.9, i.e. their prediction is simpler. Then, in practice, the analysis of the grades "PW" and "PPH" is mainly carried out, leading to the following conclusions:

(1) the addition of the audio modal information can enable the prosodic boundary labeling method to be more accurate.

(2) Pre-training can improve the performance of large models. The method based on the former has larger parameter, has better performance after pre-training, and when the model scale is larger, fine tuning in the training can cause overfitting and reduce the performance.

(3) The method based on the former structure has better performance than the method based on the CNN structure, which is not only due to the larger model scale of the former, but also benefits from the modeling capability of the former on the context semantic rhythm.

(4) The prosodic boundary labeling method provided by the embodiment of the application is better than manual labeling in performance.

Further, referring to fig. 11, which shows a consistency check coefficient matrix 1100 of seven annotators in the second method, it can be seen that the understanding of the boundary of prosodic words and prosodic phrases by different annotators is not completely consistent. As shown in fig. 11, for prosodic words, the value of the consistency check coefficient between different annotators is significantly lower than 0.6. This results in that prosodic boundary labels of different annotators for different batches cannot be used directly together due to inconsistent criteria for prosodic words and prosodic phrases. The prosody boundary labeling method provided by the embodiment of the application can use a unified standard to distinguish prosody with different granularities, so that higher labeling consistency is obtained.

2. Manual evaluation

And further evaluating the performance of the prosody boundary labeling method provided by the embodiment of the application through two manual evaluations.

(1) Control test

Randomly pick 300 utterances from the test set, which were predicted by method five (where the audio encoder was pre-trained and the model corresponding to this method freezes the pre-trained audio encoder parameters during training) to be different from the artificial labels in the original data set. The 3 annotators compare which of the two annotations is more consistent with the audio. To eliminate the deviation, the automatically labeled data and the manually labeled data are scrambled by method five.

The results showed that there were 51% (153) utterances and that the data annotated automatically by method five received more votes than the manually annotated data. This shows that the prosodic boundary labeling method provided in the embodiment of the present application has an accuracy comparable to that of manual labeling.

(2) Speech synthesis system evaluation

One application scenario of the prosody boundary labeling method provided by the embodiment of the application provides training data for a speech synthesis system, so that the cost of labeling the training data of the speech synthesis system is reduced. Referring to fig. 12, the server obtains a text 1201 and an audio 1202, and inputs the text 1201 and the audio 1202 into a prosody boundary labeling model 1203 corresponding to the prosody boundary labeling method provided in the embodiment of the present application, so as to output training data 1204 of the speech synthesis system labeled with prosody boundaries.

Therefore, whether the prosodic boundary labeling method provided by the embodiment of the application is enough to replace manual labeling needs to be explored in training of a speech synthesis system.

The data of the prosodic boundary marking method (marking method one), the manual marking (marking method two) and the prosodic-free marking (marking method three) provided by the embodiment of the application are respectively used for training the voice synthesis system. The same text and prosody are input to these systems and Mean Opinion Score (MOS) testing is performed on the generated speech. The generated speech is scored by 24 annotators, with a minimum score of 1 and a maximum score of 5, depending on the naturalness of the synthesized speech. The results of MOS tests using 95% confidence intervals for speech produced by a speech synthesis system trained with different prosody labels are shown in fig. 13. It can be seen from the results that data training of any kind of prosodic boundary labeling can significantly improve the naturalness of the speech synthesis system. The voice synthesis system trained by the data labeled by the prosody boundary labeling method provided by the embodiment of the application has better grading than the manually labeled prosody. The prosody boundary marking method is consistent with the results of contrast tests, and indicates that the prosody boundary is confused by the voice synthesis system due to the inconsistency of the manually marked prosody, and is difficult to model the prosody in the voice synthesis system.

Referring to fig. 14, a block diagram of a prosodic boundary labeling apparatus according to an exemplary embodiment of the present application is shown, where the apparatus includes the following modules:

the data acquisition module 1400 is configured to acquire a target text and a target audio, where a text content of the target text is matched with an audio content of the target audio, and the target text is a text to be subjected to prosodic boundary identification;

the feature extraction module 1410 is configured to extract text prosody feature representations of the target text by using characters as analysis granularity; and extracting an audio prosodic feature representation of the target audio on the basis of analysis of the utterance content;

a feature fusion module 1420, configured to fuse the text prosody feature representation and the audio prosody feature representation to obtain a fusion prosody feature representation;

the feature analysis module 1430 is configured to perform prosody boundary prediction on the target text based on the fused prosody feature representation, to obtain a prosody boundary labeling result with a length equal to that of the target text, where the prosody boundary labeling result includes prosody boundaries divided on the target text by using characters as granularity.

In some alternative embodiments, the prosodic boundaries comprise at least one of word boundaries, grammar word boundaries, prosodic phrase boundaries, intonation phrase boundaries;

the word boundary is a boundary dividing characters in the target text;

the grammar word boundary is a boundary for dividing grammar words in the target text;

the prosodic word boundaries are boundaries for dividing prosodic words in the target text;

the prosodic phrase boundary is a boundary dividing prosodic phrases in the target text;

the intonation phrase boundaries are boundaries for dividing the intonation phrases in the target text.

Referring to fig. 15, in some alternative embodiments, the feature analysis module 1430 includes:

a determining sub-module 1431, configured to perform prosodic boundary prediction on the target text based on the fused prosodic feature representation, and determine a prosodic boundary type corresponding to a character in the target text;

and a labeling sub-module 1432, configured to label the characters in the target text with the prosodic boundary type, so as to obtain a prosodic boundary labeling result with a length equal to that of the target text.

In some optional embodiments, the feature extraction module 1410 includes:

the segmentation submodule 1411 is configured to perform character segmentation on the target text to obtain a plurality of character data in the target text;

an extracting sub-module 1412, configured to extract word vectors corresponding to the plurality of character data respectively;

the first processing sub-module 1413 is configured to input the word vector into a text encoder, and output a text prosody feature representation of the target text, where the text encoder is an encoder obtained through pre-training of a text corpus.

The extraction sub-module 1412, further configured to extract a frequency-domain feature representation and a pitch feature representation of the target audio, where the frequency-domain feature representation and the pitch feature representation are used to indicate the content of the utterance of the target audio; in some optional embodiments, the feature extraction module 1410 further includes:

the splicing submodule 1414 is used for splicing the frequency domain feature representation and the pitch feature representation to obtain a target feature representation;

the first processing sub-module 1413 is further configured to input the target feature representation into an audio encoder, and output the target feature representation to obtain an audio prosody feature representation of the target audio, where the audio encoder is an encoder obtained through pre-training of a speech data set.

In some optional embodiments, the first processing sub-module 1413 further includes:

an input unit 1415, configured to input the target feature representation into the audio encoder, so as to obtain a first speech posterior probability map, where the first speech posterior probability map is used to indicate a phoneme-level posterior probability of the target audio;

an output unit 1416, configured to output an audio prosody feature representation of the target audio based on the first speech posterior probability map.

The input unit 1415 is further configured to input the target feature representation into the audio encoder, so as to obtain a second speech posterior probability map, where the second speech posterior probability map is used to indicate a word-level posterior probability of the target audio; the output unit 1416 is further configured to output an audio prosody feature representation of the target audio based on the second speech posterior probability map.

In some optional embodiments, the target audio is divided into a plurality of audio segments to be analyzed, and the target feature representation includes segment feature representations corresponding to the target audio segments; the input unit 1415 is further configured to input the segment feature representation into the audio encoder, so as to obtain a posterior probability sub-graph corresponding to the segment feature representation; the output unit 1416 is further configured to integrate the posterior probability subgraphs corresponding to the multiple audio segments, respectively, to obtain the second speech posterior probability graph.

In some optional embodiments, the feature fusion module 1420 includes:

a conversion sub-module 1421, configured to perform dimension conversion on the audio prosody feature representation by taking the dimension of the text prosody feature representation as a target; the feature fusion module 1420 is further configured to fuse the audio prosody feature representation after the dimension conversion with the text prosody feature representation to obtain the fused prosody feature representation.

In some optional embodiments, the data obtaining module 1400 is further configured to obtain a preset phrase library, where the preset phrase library includes a plurality of preset phrases; the device further comprises:

the data matching module 1440 is configured to match the target text with a preset phrase in the preset phrase library to obtain a reference matching result; the feature analysis module 1430 further includes:

the prediction submodule 1433 is configured to perform prosodic boundary prediction on the target text based on the fused prosodic feature representation to obtain a prediction result;

the second processing sub-module 1434 is configured to obtain the prosodic boundary labeling result with the same length as the target text based on the prediction result and the reference matching result.

In some optional embodiments, the second processing sub-module 1434 is further configured to adjust the prediction result based on the reference matching result, so as to obtain the prosodic boundary labeling result with a length equal to that of the target text.

In some optional embodiments, the reference matching result includes a prosody boundary reference annotation sequence, and the prediction result includes a prosody boundary prediction annotation sequence; the second processing sub-module 1434, further comprising:

an alignment unit 1435, configured to compare the prosody boundary reference tagging sequence with the prosody boundary prediction tagging sequence to obtain a prosody boundary comparison result, where the prosody boundary comparison result is used to indicate a difference between the reference matching result and the prediction result; the second processing sub-module 1434 is further configured to determine, based on the prosodic boundary comparison result, the prosodic boundary labeling result that is as long as the target text.

In some optional embodiments, the preset phrases in the preset phrase library are correspondingly weighted, and the prosodic boundary reference labeling sequence includes a first prosodic boundary identifier; the second processing sub-module 1434, further comprising:

an obtaining unit 1436, configured to obtain a phrase indicated by the first prosodic boundary identifier in the target text, in response to that the first prosodic boundary identifier does not exist in the prosodic boundary prediction tagging sequence;

the obtaining unit 1436 is further configured to match the phrase with the preset phrase library, and obtain a weight matched with the phrase;

an updating unit 1437, configured to update the prosody boundary prediction labeling sequence to obtain the prosody boundary labeling result in response to that the weight matched with the phrase reaches a preset weight threshold.

In some optional embodiments, the data obtaining module 1400 is further configured to obtain a candidate text and a preset sentence library, where the candidate text includes a plurality of target sentences, and the preset sentence library includes a plurality of preset sentences; the data obtaining module 1400 further includes:

a matching sub-module 1401, configured to match the multiple target sentences with preset sentences in the preset sentence library, respectively, to obtain multiple sentence matching results;

the screening submodule 1402 is configured to screen multiple sentence matching results to obtain a target sentence, which is failed to match a preset sentence in the preset sentence library, in the candidate text, and use the target sentence as the target text; and acquiring the audio matched with the target text as the target audio.

To sum up, the prosody boundary labeling device provided in the embodiment of the present application fuses text prosody feature representations of a target text and audio prosody feature representations of a target audio to obtain fused prosody feature representations, and performs prosody boundary prediction on the fused prosody feature representations, wherein the audio prosody feature representations include prosody boundary information, so that accuracy of prosody boundary labeling on the target text is improved; and the character is used as the granularity to analyze the target text and predict the prosodic boundary, so that the fine granularity of the prosodic boundary labeling result is improved, and the accuracy of prosodic boundary labeling on the target text is further improved.

It should be noted that: the prosody boundary labeling apparatus provided in the foregoing embodiment is only illustrated by dividing the functional modules, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the embodiments of the prosody boundary labeling device and the prosody boundary labeling method provided in the foregoing embodiments belong to the same concept, and specific implementation processes thereof are described in the embodiments of the methods, and are not described herein again.

Fig. 16 shows a schematic structural diagram of a server provided in an exemplary embodiment of the present application. The server may be a server as shown in fig. 2. Specifically, the structure comprises the following structures:

the server 1600 includes a Central Processing Unit (CPU) 1601, a system Memory 1604 including a Random Access Memory (RAM) 1602 and a Read Only Memory (ROM) 1603, and a system bus 1605 connecting the system Memory 1604 and the Central Processing Unit 1601. The server 1600 also includes a mass storage device 1606 for storing an operating system 1613, application programs 1614, and other program modules 1615.

The mass storage device 1606 is connected to the central processing unit 1601 by a mass storage controller (not shown) connected to the system bus 1605. The mass storage device 1606 and its associated computer-readable media provide non-volatile storage for the server 1600. That is, the mass storage device 1606 may include a computer-readable medium (not shown) such as a hard disk or Compact disk Read Only Memory (CD-ROM) drive.

Without loss of generality, computer readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, Erasable Programmable Read-Only Memory (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), flash Memory or other solid state Memory technology, CD-ROM, Digital Versatile Disks (DVD), or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage, or other magnetic storage devices. Of course, those skilled in the art will appreciate that computer storage media is not limited to the foregoing. The system memory 1604 and mass storage device 1606 described above may collectively be referred to as memory.

According to various embodiments of the application, the server 1600 may also operate with remote computers connected to a network, such as the Internet. That is, the server 1600 may be connected to the network 1612 through the network interface unit 1611 that is coupled to the system bus 1605, or the network interface unit 1611 may be used to connect to other types of networks or remote computer systems (not shown).

The memory further includes one or more programs, and the one or more programs are stored in the memory and configured to be executed by the CPU.

Embodiments of the present application also provide a computer device, which may be implemented as a terminal or a server as shown in fig. 3. The computer device comprises a processor and a memory, wherein at least one instruction, at least one program, a code set or an instruction set is stored in the memory, and the at least one instruction, the at least one program, the code set or the instruction set is loaded and executed by the processor to realize the prosodic boundary labeling method provided by the method embodiments.

Embodiments of the present application further provide a computer-readable storage medium, where at least one instruction, at least one program, a code set, or an instruction set is stored on the computer-readable storage medium, and the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by a processor to implement the prosody boundary labeling method provided by the foregoing method embodiments.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to make the computer device execute the prosody boundary labeling method provided by the method embodiments.

Optionally, the computer-readable storage medium may include: a Read Only Memory (ROM), a Random Access Memory (RAM), a Solid State Drive (SSD), or an optical disc. The Random Access Memory may include a resistive Random Access Memory (ReRAM) and a Dynamic Random Access Memory (DRAM). The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A prosodic boundary labeling method, the method comprising:

2. The method of claim 1, wherein the prosodic boundaries comprise at least one of word boundaries, grammar word boundaries, prosodic phrase boundaries, intonation phrase boundaries;

the word boundary is a boundary dividing characters in the target text;

the prosodic word boundary is a boundary for dividing prosodic words in the target text;

3. The method of claim 2, wherein performing prosodic boundary prediction on the target text based on the fused prosodic feature representation to obtain a prosodic boundary labeling result equal to the target text comprises:

performing prosodic boundary prediction on the target text based on the fused prosodic feature representation, and determining prosodic boundary types corresponding to characters in the target text;

and marking characters in the target text by the prosodic boundary type to obtain a prosodic boundary marking result with the same length as the target text.

4. The method according to any one of claims 1 to 3, wherein the extracting text prosody features of the target text with the character as analysis granularity comprises:

performing character segmentation on the target text to obtain a plurality of character data in the target text;

extracting word vectors corresponding to a plurality of character data respectively;

and inputting the word vector into a text encoder, and outputting text prosody feature representation of the target text, wherein the text encoder is obtained by pre-training a text corpus.

5. The method according to any one of claims 1 to 3, wherein the extracting the audio prosodic feature representation of the target audio based on the analysis of the utterance content comprises:

extracting a frequency domain feature representation and a pitch feature representation of the target audio, the frequency domain feature representation and the pitch feature representation being indicative of the content of an utterance of the target audio;

splicing the frequency domain feature representation and the pitch feature representation to obtain a target feature representation;

and inputting the target feature representation into an audio encoder, and outputting to obtain the audio prosody feature representation of the target audio, wherein the audio encoder is obtained by pre-training a voice data set.

6. The method of claim 5, wherein inputting the target feature representation into an audio encoder and outputting the target feature representation to obtain an audio prosody feature representation of the target audio comprises:

inputting the target feature representation into the audio encoder to obtain a first speech posterior probability graph, wherein the first speech posterior probability graph is used for indicating phoneme level posterior probability of the target audio;

and outputting and obtaining the audio prosody feature representation of the target audio based on the first voice posterior probability graph.

7. The method of claim 5, wherein inputting the target feature representation into an audio encoder and outputting the target feature representation to obtain an audio prosody feature representation of the target audio comprises:

inputting the target feature representation into the audio encoder to obtain a second voice posterior probability graph, wherein the second voice posterior probability graph is used for indicating the word-level posterior probability of the target audio;

and outputting and obtaining the audio prosody feature representation of the target audio based on the second voice posterior probability graph.

8. The method according to claim 7, wherein the target audio is divided into a plurality of audio segments for analysis, and the target feature representation includes segment feature representations corresponding to the target audio segments;

inputting the target feature representation into the audio encoder to obtain a second speech posterior probability map, including:

inputting the segment feature representation into the audio encoder to obtain a posterior probability subgraph corresponding to the segment feature representation;

and integrating the posterior probability subgraphs corresponding to the plurality of audio clips respectively to obtain the second voice posterior probability graph.

9. The method according to any one of claims 1 to 3, wherein the fusing the text prosody feature representation and the audio prosody feature representation to obtain a fused prosody feature representation comprises:

taking the dimension represented by the text prosody feature as a target, and performing dimension conversion on the audio prosody feature;

and fusing the audio prosody feature representation after the dimension conversion with the text prosody feature representation to obtain the fused prosody feature representation.

10. The method of any of claims 1 to 3, further comprising:

acquiring a preset phrase library, wherein the preset phrase library comprises a plurality of preset phrases;

matching the target text with a preset phrase in the preset phrase library to obtain a reference matching result;

performing prosody boundary prediction on the target text based on the fused prosody feature representation to obtain a prosody boundary labeling result with the same length as the target text, including:

performing prosodic boundary prediction on the target text based on the fused prosodic feature representation to obtain a prediction result;

and obtaining the prosodic boundary labeling result with the same length as the target text based on the prediction result and the reference matching result.

11. The method of claim 10, wherein obtaining the prosodic boundary labeling result with a length equal to the target text based on the prediction result and the reference matching result comprises:

and adjusting the prediction result based on the reference matching result to obtain the prosodic boundary labeling result with the same length as the target text.

12. The method of claim 11, wherein the reference matching result comprises a prosodic boundary reference tag sequence, and the prediction result comprises a prosodic boundary prediction tag sequence;

the adjusting the prediction result based on the reference matching result to obtain the prosodic boundary labeling result with the same length as the target text, includes:

comparing the prosody boundary reference marking sequence with the prosody boundary prediction marking sequence to obtain a prosody boundary comparison result, wherein the prosody boundary comparison result is used for indicating the difference between the reference matching result and the prediction result;

and determining the prosodic boundary labeling result with the same length as the target text based on the prosodic boundary comparison result.

13. The method according to claim 12, wherein the preset phrases in the preset phrase library are associated with weights, and the prosodic boundary reference labeling sequence includes a first prosodic boundary identifier;

the obtaining of the prosodic boundary labeling result as long as the target text based on the prosodic boundary comparison result includes:

in response to the first prosodic boundary identifier not existing in the prosodic boundary prediction annotation sequence, obtaining a phrase indicated by the first prosodic boundary identifier in the target text;

matching the phrases with the preset phrase library to obtain the weight matched with the phrases;

and updating the prosodic boundary prediction labeling sequence to obtain a prosodic boundary labeling result in response to the weight matched with the phrase reaching a preset weight threshold.

14. The method of any one of claims 1 to 3, wherein the obtaining the target text and the target audio comprises:

acquiring a candidate text and a preset sentence library, wherein the candidate text comprises a plurality of target sentences, and the preset sentence library comprises a plurality of preset sentences;

matching the target sentences with preset sentences in the preset sentence library respectively to obtain a plurality of sentence matching results;

screening a plurality of sentence matching results to obtain a target sentence which is failed to be matched with a preset sentence in the preset sentence library in the candidate text and serves as the target text; and acquiring the audio matched with the target text as the target audio.

15. A prosodic boundary labeling apparatus, the apparatus comprising:

16. A computer device comprising a processor and a memory, wherein the memory stores at least one program which is loaded and executed by the processor to implement the prosodic boundary labeling method of any one of claims 1 to 14.

17. A computer-readable storage medium, wherein at least one program is stored in the storage medium, and the at least one program is loaded and executed by a processor to implement the prosodic boundary labeling method of any one of claims 1 to 14.

18. A computer program product comprising computer instructions which, when executed by a processor, implement a prosodic boundary labeling method as claimed in any one of claims 1 to 14.