CN116168684A

CN116168684A - Training text prosody prediction model, and method and device for predicting text prosody

Info

Publication number: CN116168684A
Application number: CN202310196588.1A
Authority: CN
Inventors: 刘凯; 杜新凯; 蔡岩松; 唐延欢; 王天霖
Original assignee: Sunshine Insurance Group Co Ltd
Current assignee: Sunshine Insurance Group Co Ltd
Priority date: 2023-02-24
Filing date: 2023-02-24
Publication date: 2023-05-26

Abstract

The application provides a method and a device for training a text prosody prediction model and predicting text prosody, comprising the steps of obtaining a plurality of time stamps corresponding to each voice in a plurality of sample pairs; performing prosody annotation on each text in a plurality of sample pairs according to a plurality of time stamps to obtain a plurality of annotated texts; acquiring prosody information corresponding to each text in a plurality of marked texts; training the basic model through prosody information corresponding to each text in the plurality of sample pairs and each text in the plurality of labeling texts to obtain a text prosody prediction model. The text prosody prediction model obtained by the method can accurately predict the prosody of the text, and further can enable the intelligent robot to generate the effect of natural audio of the prosody.

Description

Training text prosody prediction model, and method and device for predicting text prosody

Technical Field

The present application relates to the field of controlling audio prosody, and in particular, to a method and apparatus for training a text prosody prediction model and predicting text prosody.

Background

At present, along with the continuous popularization and application of related technologies such as big data, cloud computing, the Internet and the like, man-machine intelligence is a necessary development trend. The traditional intelligent customer service robot is trained by massive text data, and prosody codes of the audio are coded through a deep learning model, so that prosody characteristics are extracted.

However, most existing intelligent customer service robots speak unnatural, the voice prosody is relatively mechanical, and a large amount of resources are consumed in the model training process.

Therefore, how to make the intelligent robot generate the audio with natural rhythm is a technical problem to be solved.

Disclosure of Invention

The embodiment of the application aims to provide a method for training a text prosody prediction model, and the effect of enabling an intelligent robot to generate natural prosody audio can be achieved through the technical scheme of the embodiment of the application.

In a first aspect, an embodiment of the present application provides a method for training a text prosody prediction model, including obtaining a plurality of timestamps corresponding to each voice in a plurality of sample pairs, where the sample pairs include a voice and a recognition text corresponding to the voice; performing prosody annotation on each text in a plurality of sample pairs according to a plurality of time stamps to obtain a plurality of annotated texts, wherein the prosody annotation comprises annotation at a voice pause; acquiring prosody information corresponding to each text in the plurality of marked texts, wherein the prosody information comprises the interval distance of prosody marking and the part of speech of adjacent words of the prosody marking; training the basic model through prosody information corresponding to each text in the plurality of sample pairs and each text in the plurality of labeling texts to obtain a text prosody prediction model.

In the embodiment, the sample is marked in a time stamp mode, the text prosody prediction model can be trained by the obtained marked sample while considering the prosody information of the text, and the text prosody prediction model obtained by the method can accurately predict the text prosody, so that the intelligent robot can generate natural prosody audio when synthesizing the voice subsequently.

In some embodiments, before obtaining the plurality of time stamps corresponding to each of the plurality of speech in the plurality of sample pairs, further comprising:

combining a plurality of voices stored in human-computer dialogues in different scenes and texts corresponding to the voices into a plurality of sample pairs;

training a basic speech recognition model through part of sample pairs in a plurality of sample pairs to obtain a speech recognition model;

and inputting the other sample pairs except for part of the sample pairs in the plurality of sample pairs into a voice recognition model to obtain a plurality of voice recognition results and a plurality of time stamps.

In the above embodiment, the voice recognition model is obtained through training of part of the sample pairs in the plurality of sample pairs, and then other sample pairs are input into the voice recognition model, so that the time stamp corresponding to the voice in the sample pair can be obtained, and further the sample pairs can be accurately marked through the time stamp.

In some embodiments, the separation distance includes a separation distance of two adjacent prosody annotations among the one or more prosody annotations corresponding to the preset pause time range and a separation distance of a prosody annotation corresponding to the preset pause time range and a prosody annotation corresponding to the second preset pause time range;

the parts of speech of the neighboring words include parts of speech of the neighboring words before and after each prosody tag of the one or more prosody tags.

In the above embodiment, when the model is trained by the prosody information, a text prosody prediction model for accurately predicting the text prosody information may be obtained.

In some embodiments, prosodic labeling of each text in a plurality of sample pairs according to a plurality of time stamps, resulting in a plurality of labeled text, comprising:

generating prosody annotation symbols corresponding to different preset pause time ranges;

determining a preset pause time range corresponding to each voice pause time displayed in the plurality of time stamps;

and marking each text in the sample pair by a prosody marking symbol according to a preset pause time range corresponding to each voice pause time.

In the above embodiment, according to the preset time range where the time of the voice pause is, different prosodic symbols of the text can be clearly marked, so that prosodic information for training the text prosodic prediction model can be accurately obtained.

In a second aspect, an embodiment of the present application provides a method for predicting text prosody, including converting punctuation marks in a text to be predicted into corresponding prosody markup symbols, to obtain a converted text, where different punctuation marks correspond to different preset pause time ranges, and different preset pause time ranges correspond to different prosody markup symbols; and inputting the converted text into a preset text prosody prediction model to obtain a text prosody prediction result, wherein the text prosody prediction result comprises the text with prosody annotation.

According to the method and the device for predicting the text prosody, the text prosody can be predicted accurately according to different preset pause time ranges corresponding to different punctuations through the text prosody prediction model, and then the intelligent robot can generate natural audio of prosody during subsequent speech synthesis.

In some embodiments, after inputting the converted text into a preset text prosody prediction model to obtain a text prosody prediction result, the method further includes:

converting the text with prosody marks in the text prosody prediction result into pinyin;

converting the pinyin into phonemes through a dictionary;

phonemes are synthesized into prosodic audio through a speech conversion model.

In the above embodiment, after the prosody prediction result of the text is accurately obtained, the text can be converted into phonemes, and finally, audio with prosody is synthesized according to the phonemes, and the audio can naturally simulate the prosody of a person speaking.

In some embodiments, converting punctuation marks in the text to be predicted into corresponding prosodic annotation marks, resulting in converted text, including:

word segmentation is carried out on the text to be predicted, and a text after word segmentation is obtained;

separating words in the text after word segmentation by using preset symbols, and converting punctuation marks into corresponding prosody annotation marks to obtain a converted text.

In the above embodiment, through the conversion of the word segmentation and punctuation marks, the pause time of the text when the text is converted into the voice can be distinguished, and then the audio with the rhythm is formed.

In some embodiments, before converting punctuation marks in the text to be predicted into corresponding prosodic annotations, further comprising:

acquiring a plurality of time stamps corresponding to each voice in a plurality of sample pairs, wherein the sample pairs comprise a voice and a recognition text corresponding to the voice;

performing prosody annotation on each text in a plurality of sample pairs according to a plurality of time stamps to obtain a plurality of annotated texts, wherein the prosody annotation comprises annotation at a voice pause;

Acquiring prosody information corresponding to each text in the plurality of marked texts, wherein the prosody information comprises the interval distance of prosody marking and the part of speech of adjacent words of the prosody marking;

training the basic model through prosody information corresponding to each text in the plurality of sample pairs and each text in the plurality of labeling texts to obtain a text prosody prediction model.

In a third aspect, an embodiment of the present application provides an apparatus for training a text prosody prediction model, including:

the first acquisition module is used for acquiring a plurality of time stamps corresponding to each voice in a plurality of sample pairs, wherein the sample pairs comprise a voice and a recognition text corresponding to the voice;

the marking module is used for marking each text in the plurality of sample pairs according to the plurality of time stamps to obtain a plurality of marked texts, wherein the marking of the rhythm comprises marking at a voice pause;

The second acquisition module is used for acquiring prosodic information corresponding to each text in the plurality of marked texts, wherein the prosodic information comprises the interval distance of prosodic marks and the part of speech of adjacent words of the prosodic marks;

the training module is used for training the basic model through prosody information corresponding to each text in the plurality of sample pairs and each text in the plurality of labeling texts to obtain a text prosody prediction model.

Optionally, the apparatus further includes:

the recognition module is used for combining a plurality of voices stored in the man-machine conversation in different scenes and texts corresponding to the voices into a plurality of sample pairs before the first acquisition module acquires a plurality of time stamps corresponding to each voice in the plurality of sample pairs;

and inputting the other sample pairs except for part of the sample pairs in the plurality of sample pairs into a recognition synthesis model to obtain a plurality of voice recognition results and a plurality of time stamps.

Optionally, the interval distance includes an interval distance between two adjacent prosody annotations in the one or more prosody annotations corresponding to the preset pause time range and an interval distance between a prosody annotation corresponding to the preset pause time range and a prosody annotation corresponding to the second preset pause time range;

Optionally, the labeling module is specifically configured to:

In a fourth aspect, an embodiment of the present application provides an apparatus for predicting text prosody, including:

the conversion module is used for converting punctuation marks in the text to be predicted into corresponding prosodic marks to obtain a converted text, wherein different punctuation marks correspond to different preset pause time ranges, and different preset pause time ranges correspond to different prosodic marks;

the prediction module is used for inputting the converted text into a preset text prosody prediction model to obtain a text prosody prediction result, wherein the text prosody prediction result comprises text with prosody labels.

The apparatus optionally further comprises:

the synthesis module is used for inputting the converted text into a preset text prosody prediction model to obtain a text prosody prediction result, and converting the text with prosody marks in the text prosody prediction result into pinyin;

Converting the pinyin into phonemes through a dictionary;

phonemes are synthesized into prosodic audio through a speech conversion model.

Optionally, the conversion module is specifically configured to:

Optionally, the device further comprises:

the training module is used for acquiring a plurality of time stamps corresponding to each voice in a plurality of sample pairs before the conversion module converts punctuation marks in the text to be predicted into corresponding prosodic annotation marks, wherein the sample pairs comprise a voice and a recognition text corresponding to the voice;

In a fifth aspect, embodiments of the present application provide an electronic device comprising a processor and a memory storing computer readable instructions that, when executed by the processor, perform the steps of the method as provided in the first aspect above.

In a sixth aspect, embodiments of the present application provide a readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the method as provided in the first aspect above.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be apparent from the description, or may be learned by practice of the embodiments of the application. The objectives and other advantages of the application will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a method for training a text prosody prediction model provided in an embodiment of the present application;

FIG. 2 is a flowchart of a method for predicting text prosody provided by an embodiment of the present application;

FIG. 3 is a flow chart of a method of synthesizing audio with prosody according to an embodiment of the present application;

FIG. 4 is a schematic block diagram of an apparatus for training a text prosody prediction model provided in an embodiment of the present application;

FIG. 5 is a schematic block diagram of an apparatus for predicting text prosody according to an embodiment of the present application;

FIG. 6 is a schematic block diagram of an apparatus for training a text prosody prediction model according to an embodiment of the present application;

fig. 7 is a schematic block diagram of a device for predicting text prosody according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. The components of the embodiments of the present application, as generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, as provided in the accompanying drawings, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, are intended to be within the scope of the present application.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only to distinguish the description, and are not to be construed as indicating or implying relative importance.

Some of the terms referred to in the embodiments of the present application will be described first to facilitate understanding by those skilled in the art.

ASR: refers to an automatic speech recognition technique (Automatic Speech Recognition), which is a technique that converts human speech into text.

TTS model: (texttospech) a speech synthesis model that converts text to speech.

Timestamp: a share can represent a complete verifiable data for which a share of data already exists at a particular point in time. It is proposed to provide mainly an electronic proof for the user to prove the time of generation of certain data of the user.

The method is applied to a scene for controlling the audio rhythm, the specific scene is a training text rhythm prediction model, the text rhythm is predicted through the text rhythm prediction model, and the audio with rhythm is synthesized through the text rhythm.

However, at present, with the continuous popularization and application of related technologies such as big data, cloud computing and the internet, the human-computer intelligence is a necessary development trend. The traditional intelligent customer service robot is trained by massive text data, and prosody codes of the audio are coded through a deep learning model, so that prosody characteristics are extracted. Most existing intelligent customer service robots are unnatural in speaking, relatively mechanical in voice rhythm, and consume a large amount of resources in the model training process. In addition, it is very difficult to obtain voice data under different service scenarios in many cases, for example, there is insufficient data or the size of the data is relatively small for the model when the project is started, and the corresponding cost of adding data purchase is greatly increased. At the same time, the corresponding time of a large deep learning model is very much, customers want to have a good experience, the model must be deployed on a GPU server, and the corresponding need further increases the cost. These two points will have a serious impact on building a prosodic nature of the customer service robot.

To this end, the present application provides a text prosody prediction method combining ASR time stamping and machine learning, including, by obtaining a plurality of time stamps corresponding to each speech in a plurality of sample pairs, wherein the sample pairs include a speech and a recognition text corresponding to the speech; performing prosody annotation on each text in a plurality of sample pairs according to a plurality of time stamps to obtain a plurality of annotated texts, wherein the prosody annotation comprises annotation at a voice pause; acquiring prosody information corresponding to each text in the plurality of marked texts, wherein the prosody information comprises the interval distance of prosody marking and the part of speech of adjacent words of the prosody marking; training the basic model through prosody information corresponding to each text in the plurality of sample pairs and each text in the plurality of labeling texts to obtain a text prosody prediction model. The method is characterized in that a sample is marked in a time stamp mode, a text prosody prediction model can be trained through the obtained marked sample while the prosody information of the text is considered, the text prosody prediction model obtained through the method can accurately predict the text prosody, and then the intelligent robot can generate natural prosody audio when the voice is synthesized subsequently.

In the embodiment of the present application, the execution body may be a voice synthesis device in a voice synthesis system, and in practical application, the voice synthesis may be electronic devices such as a terminal device and a server, which is not limited herein.

The method of training a text prosody prediction model according to an embodiment of the present application is described in detail below with reference to fig. 1.

Referring to fig. 1, fig. 1 is a flowchart of a method for training a text prosody prediction model according to an embodiment of the present application, where the method for training a text prosody prediction model shown in fig. 1 includes:

step 110: a plurality of time stamps corresponding to each voice in a plurality of sample pairs are obtained.

The sample pair comprises a voice and a recognition text corresponding to the voice, wherein the voice and the recognition text are interaction data between customer service and clients collected manually in different scenes in the related fields such as collection or sales, and the interaction data can also be interaction data between robots and clients, and the interaction data comprise interaction audio data and recognition data after the audio is converted into the text. The time stamp may be a table or time-stamped sequence that includes the time of occurrence at each pause in the audio and the time of ending at the pause. In addition, the training of the final text prosody prediction model can be realized by combining machine learning with a small number of samples.

In some embodiments of the present application, before obtaining the plurality of time stamps corresponding to each voice in the plurality of sample pairs, the method shown in fig. 1 further includes: combining a plurality of voices stored in human-computer dialogues in different scenes and texts corresponding to the voices into a plurality of sample pairs; training a basic speech recognition model through part of sample pairs in a plurality of sample pairs to obtain a speech recognition model; and inputting the other sample pairs except for part of the sample pairs in the plurality of sample pairs into a voice recognition model to obtain a plurality of voice recognition results and a plurality of time stamps.

In the process, the voice recognition model is obtained through training of part of the sample pairs in the plurality of sample pairs, other sample pairs are input into the voice recognition model, the time stamp corresponding to the voice in the sample pair can be obtained, and then the sample pairs can be accurately marked through the time stamp.

The different scenes can be an induced harvest scene, a sales scene or a product explanation scene in the insurance field, and the like. The basic speech recognition model is trained by a part of the sample pairs in the plurality of sample pairs to obtain a speech recognition model, for example, 1/4 of the sample pairs in the plurality of sample pairs are used as training samples for training the speech recognition model. And taking the voices in other sample pairs except the training sample in the sample pair as input samples for inputting a voice recognition model, so that a corresponding recognition result can be obtained as a recognition text, and simultaneously, marking the time of each pause in each recognition text through the voice recognition model to obtain a plurality of time stamps. In addition, by using machine learning and ASR models (speech recognition models), the machine learning model consumes less time than the deep learning model, the time for predicting prosody by the deep learning model can be greatly reduced, smoother service experience is brought to clients, and meanwhile, the cost of using a server by an enterprise is reduced.

Step 120: and marking each text in the plurality of sample pairs according to the plurality of time stamps to obtain a plurality of marked texts.

Wherein the prosodic annotations include annotations at pauses in speech. The labeling text at least comprises labels of pauses in prosody, and can also comprise labels of sound size, emotion, long and short sounds and the like.

In some embodiments of the present application, prosody marking is performed on each text in a plurality of sample pairs according to a plurality of timestamps, resulting in a plurality of marked texts, including: generating prosody annotation symbols corresponding to different preset pause time ranges; determining a preset pause time range corresponding to each voice pause time displayed in the plurality of time stamps; and marking each text in the sample pair by a prosody marking symbol according to a preset pause time range corresponding to each voice pause time.

According to the method and the device, different prosodic symbols of the text can be marked clearly according to the preset time range of the voice pause time, and prosodic information for training the text prosodic prediction model can be obtained accurately.

Wherein, prosody annotation symbols corresponding to different pause time ranges are different, and the time generated in the time stamp can be used for manually or mechanically annotating the corresponding relation between the time difference of words before and after the pause and the pause, for example, the prosody annotation symbol of the corresponding word level within 100ms is #1; prosody annotation symbol corresponding to the pause number level with the time range of 100ms-200ms is #2; prosody annotation symbol corresponding to comma level with time range of 200ms-300ms is #3; the prosody label corresponding to the period level for longer periods is #4. Finally, prosody is added to the text by means of a scripting tool (python), including the presentation of the text in a script, and corresponding prosody tags are also inserted into the text by means of the script.

Step 130: and acquiring prosody information corresponding to each text in the plurality of marked texts.

The prosodic information comprises the interval distance of prosody tags and the parts of speech of adjacent words of the prosody tags.

In some embodiments of the present application, the separation distance includes a separation distance of two adjacent prosody annotations among the one or more prosody annotations corresponding to the preset pause time range and a separation distance of a prosody annotation corresponding to the preset pause time range and a prosody annotation corresponding to the second preset pause time range; the parts of speech of the neighboring words include parts of speech of the neighboring words before and after each prosody tag of the one or more prosody tags.

In the process, when the model is trained through the prosody information, the text prosody prediction model for accurately predicting the text prosody information can be obtained.

Wherein, a preset prosody annotation can be marked on the time pause in the preset time range, for example, the interval distance between two adjacent prosody annotations in one or more prosody annotations corresponding to the preset pause time range can be the distance between the preset prosody annotation symbol #2 and the last and next preset prosody annotation symbols # 2; the interval distance between the prosody annotation corresponding to the preset pause time range and the prosody annotation corresponding to the adjacent second preset pause time range may be the distance between the preset prosody annotation symbol #2 and the adjacent last and next preset prosody annotation symbol # 3; the parts of speech of the words adjacent to each prosody note in the one or more prosody notes may be parts of speech of the words adjacent to each other before and after the preset prosody note #2, including adjectives, nouns, adverbs, and the like, and may also be some words capable of representing emotion. In addition, the prosody information can be divided into samples with different proportions, for example, 0.7:0.2:0.1, which are respectively used as a training set, a test set and a verification set, wherein the training set is used for training the model, the verification set is used for verifying the model effect in the training process, and the test set is used for verifying the final effect of the trained model.

Step 140: training the basic model through prosody information corresponding to each text in the plurality of sample pairs and each text in the plurality of labeling texts to obtain a text prosody prediction model.

The basic model can adopt a basic machine learning model, and prosody information corresponding to each text in a training set and each text in a plurality of labeling texts is used for training the machine learning model (such as models of decision trees, KNNs, support vector machines and the like).

In the process shown in fig. 1, the present application obtains a plurality of time stamps corresponding to each voice in a plurality of sample pairs, where a sample pair includes a voice and a recognition text corresponding to the voice; performing prosody annotation on each text in a plurality of sample pairs according to a plurality of time stamps to obtain a plurality of annotated texts, wherein the prosody annotation comprises annotation at a voice pause; acquiring prosody information corresponding to each text in the plurality of marked texts, wherein the prosody information comprises the interval distance of prosody marking and the part of speech of adjacent words of the prosody marking; training the basic model through prosody information corresponding to each text in the plurality of sample pairs and each text in the plurality of labeling texts to obtain a text prosody prediction model. The method is characterized in that a sample is marked in a time stamp mode, a text prosody prediction model can be trained through the obtained marked sample while the prosody information of the text is considered, the text prosody prediction model obtained through the method can accurately predict the text prosody, and then the intelligent robot can generate natural prosody audio when the voice is synthesized subsequently.

The method of predicting text prosody according to the embodiment of the present application is described in detail below with reference to fig. 2.

Referring to fig. 2, fig. 2 is a flowchart of a method for predicting text prosody according to an embodiment of the present application, where the method for predicting text prosody shown in fig. 2 includes:

step 210: and converting punctuation marks in the text to be predicted into corresponding prosody annotation marks to obtain a converted text.

Wherein, different punctuation marks correspond to different preset pause time ranges, and different preset pause time ranges correspond to different prosody marking marks. For example, prosody notation for a corresponding word level within 100ms is #1; prosody annotation symbol corresponding to the pause number level with the time range of 100ms-200ms is #2; prosody annotation symbol corresponding to comma level with time range of 200ms-300ms is #3; the prosody label corresponding to the period level for longer periods is #4. And finally, adding prosody to the text by a script tool (python), wherein the text is presented in a script mode, and the corresponding prosody annotation is inserted into the text in the script mode to obtain the converted text.

In some embodiments of the present application, before converting punctuation marks in the text to be predicted into corresponding prosodic annotations, the method shown in fig. 2 further comprises: acquiring a plurality of time stamps corresponding to each voice in a plurality of sample pairs, wherein the sample pairs comprise a voice and a recognition text corresponding to the voice; performing prosody annotation on each text in a plurality of sample pairs according to a plurality of time stamps to obtain a plurality of annotated texts, wherein the prosody annotation comprises annotation at a voice pause; acquiring prosody information corresponding to each text in the plurality of marked texts, wherein the prosody information comprises the interval distance of prosody marking and the part of speech of adjacent words of the prosody marking; training the basic model through prosody information corresponding to each text in the plurality of sample pairs and each text in the plurality of labeling texts to obtain a text prosody prediction model.

In the process, the sample is marked in a time stamp mode, the text prosody prediction model can be trained through the obtained marked sample while the prosody information of the text is considered, the text prosody prediction model obtained through the method can accurately predict the text prosody, and then the intelligent robot can generate audio with natural prosody when the voice is synthesized subsequently.

In some embodiments of the present application, converting punctuation marks in a text to be predicted into corresponding prosodic annotation marks, to obtain a converted text, including: word segmentation is carried out on the text to be predicted, and a text after word segmentation is obtained; separating words in the text after word segmentation by using preset symbols, and converting punctuation marks into corresponding prosody annotation marks to obtain a converted text.

In the process, through the conversion of word segmentation and punctuation marks, the pause time of the text in the process of converting the text into voice can be distinguished, and then audio with rhythm is formed.

Step 220: and inputting the converted text into a preset text prosody prediction model to obtain a text prosody prediction result.

The text prosody prediction result comprises text with prosody labels.

In some embodiments of the present application, after inputting the converted text into a preset text prosody prediction model to obtain a text prosody prediction result, the method shown in fig. 2 further includes: converting the text with prosody marks in the text prosody prediction result into pinyin; converting the pinyin into phonemes through a dictionary; phonemes are synthesized into prosodic audio through a speech conversion model.

According to the method and the device, after the prosody prediction result of the text is accurately obtained, the text can be converted into the phonemes, and finally, the audio with the prosody is synthesized according to the phonemes, and the audio can naturally imitate the prosody of a person when speaking.

Wherein synthesizing the phonemes into prosodic audio through the speech conversion model includes synthesizing the phonemes into prosodic audio through the TTS model.

In the process shown in fig. 2, the text prosody can be accurately predicted according to different preset pause time ranges corresponding to different punctuations through the text prosody prediction model, and then the intelligent robot can generate natural audio of prosody during subsequent speech synthesis.

In addition, the specific methods, descriptions, steps, etc. in part of the steps shown in fig. 2 may refer to the methods shown in fig. 1, and are not described in detail herein.

The method of synthesizing audio with prosody according to the embodiments of the present application is described in detail below in conjunction with fig. 3.

Referring to fig. 3, fig. 3 is a flowchart of a method for synthesizing audio with prosody according to an embodiment of the present application, where the method for synthesizing audio with prosody shown in fig. 3 includes:

step 310: training a text prosody prediction model.

Specific: acquiring a plurality of time stamps corresponding to each voice in a plurality of sample pairs; performing prosody annotation on each text in a plurality of sample pairs according to a plurality of time stamps to obtain a plurality of annotated texts; acquiring prosody information corresponding to each text in a plurality of marked texts; training the basic model through prosody information corresponding to each text in the plurality of sample pairs and each text in the plurality of labeling texts to obtain a text prosody prediction model.

Step 320: and inputting the text to be synthesized into a text prosody prediction model to obtain a text prosody prediction result.

Specific: converting punctuation marks in the text to be synthesized into corresponding prosody annotation marks to obtain a converted text; and inputting the converted text into a text prosody prediction model to obtain a text prosody prediction result.

Step 330: the text in the text prosody prediction result is converted into a plurality of phonemes.

Specific: converting the text with prosody marks in the text prosody prediction result into pinyin; and converting the pinyin into phonemes through a dictionary to obtain a plurality of phonemes.

Step 340: synthesizing a plurality of phonemes is an audio method with prosody.

Specific: phonemes are synthesized into audio with prosody by a TTS model.

In addition, the specific method and step shown in fig. 3 may refer to the methods shown in fig. 1 and fig. 2, which are not repeated herein.

The method of training a text prosody prediction model, the method of predicting text prosody, and the method of synthesizing audio with prosody are described above by way of fig. 1 to 3, and the apparatus of the text prosody prediction model and the apparatus of predicting text prosody are described below in conjunction with fig. 4 to 7.

Referring to fig. 4, a schematic block diagram of an apparatus 400 for training a text prosody prediction model according to an embodiment of the present application is provided, where the apparatus 400 may be a module, a program segment, or a code on an electronic device. The apparatus 400 corresponds to the embodiment of the method of fig. 1 described above, and is capable of performing the steps involved in the embodiment of the method of fig. 1. The specific functions of the apparatus 400 will be described below, and detailed descriptions thereof will be omitted herein as appropriate to avoid redundancy.

Optionally, the apparatus 400 includes:

a first obtaining module 410, configured to obtain a plurality of timestamps corresponding to each of a plurality of sample pairs, where the sample pairs include a voice and a recognition text corresponding to the voice;

the labeling module 420 is configured to perform prosody labeling on each text in the plurality of sample pairs according to the plurality of timestamps, so as to obtain a plurality of labeled texts, where the prosody labeling includes labeling at a voice pause;

a second obtaining module 430, configured to obtain prosodic information corresponding to each of the plurality of tagged texts, where the prosodic information includes a distance separating prosody tags and parts of speech of adjacent words of the prosody tags;

the training module 440 is configured to train the basic model according to prosody information corresponding to each text in the plurality of sample pairs and each text in the plurality of labeling texts, so as to obtain a text prosody prediction model.

Optionally, the apparatus further includes:

the recognition module is used for combining a plurality of voices stored in the man-machine conversation in different scenes and texts corresponding to the voices into a plurality of sample pairs before the first acquisition module acquires a plurality of time stamps corresponding to each voice in the plurality of sample pairs; training a basic speech recognition model through part of sample pairs in a plurality of sample pairs to obtain a speech recognition model; and inputting the other sample pairs except for part of the sample pairs in the plurality of sample pairs into a voice recognition model to obtain a plurality of voice recognition results and a plurality of time stamps.

Optionally, the interval distance includes an interval distance between two adjacent prosody annotations in the one or more prosody annotations corresponding to the preset pause time range and an interval distance between a prosody annotation corresponding to the preset pause time range and a prosody annotation corresponding to the second preset pause time range; the parts of speech of the neighboring words include parts of speech of the neighboring words before and after each prosody tag of the one or more prosody tags.

Optionally, the labeling module is specifically configured to:

generating prosody annotation symbols corresponding to different preset pause time ranges; determining a preset pause time range corresponding to each voice pause time displayed in the plurality of time stamps; and marking each text in the sample pair by a prosody marking symbol according to a preset pause time range corresponding to each voice pause time.

Referring to fig. 5, a schematic block diagram of an apparatus 500 for predicting text prosody according to an embodiment of the present application is provided, where the apparatus 500 may be a module, a program segment, or a code on an electronic device. The apparatus 500 corresponds to the above-described embodiment of the method of fig. 2, and is capable of performing the steps involved in the embodiment of the method of fig. 2, and specific functions of the apparatus 500 may be referred to as the following description, and detailed descriptions thereof are omitted herein as appropriate to avoid redundancy.

Optionally, the apparatus 500 includes:

the conversion module 510 is configured to convert punctuation marks in the text to be predicted into corresponding prosodic labels, so as to obtain a converted text, where different punctuation marks correspond to different preset pause time ranges, and different preset pause time ranges correspond to different prosodic labels;

the prediction module 520 is configured to input the converted text into a preset text prosody prediction model to obtain a text prosody prediction result, where the text prosody prediction result includes text with prosody annotations.

The apparatus optionally further comprises:

the synthesis module is used for inputting the converted text into a preset text prosody prediction model to obtain a text prosody prediction result, and converting the text with prosody marks in the text prosody prediction result into pinyin; converting the pinyin into phonemes through a dictionary; phonemes are synthesized into prosodic audio through a speech conversion model.

Optionally, the conversion module is specifically configured to:

word segmentation is carried out on the text to be predicted, and a text after word segmentation is obtained; separating words in the text after word segmentation by using preset symbols, and converting punctuation marks into corresponding prosody annotation marks to obtain a converted text.

Optionally, the device further comprises:

the training module is used for acquiring a plurality of time stamps corresponding to each voice in a plurality of sample pairs before the conversion module converts punctuation marks in the text to be predicted into corresponding prosodic annotation marks, wherein the sample pairs comprise a voice and a recognition text corresponding to the voice; performing prosody annotation on each text in a plurality of sample pairs according to a plurality of time stamps to obtain a plurality of annotated texts, wherein the prosody annotation comprises annotation at a voice pause; acquiring prosody information corresponding to each text in the plurality of marked texts, wherein the prosody information comprises the interval distance of prosody marking and the part of speech of adjacent words of the prosody marking; training the basic model through prosody information corresponding to each text in the plurality of sample pairs and each text in the plurality of labeling texts to obtain a text prosody prediction model.

Referring to fig. 6, a schematic block diagram of an apparatus for training a text prosody prediction model according to an embodiment of the present application may include a memory 610 and a processor 620. Optionally, the apparatus may further include: a communication interface 630, and a communication bus 640. The apparatus corresponds to the embodiment of the method of fig. 1 described above, and is capable of performing the steps involved in the embodiment of the method of fig. 1, and specific functions of the apparatus may be found in the following description.

In particular, memory 610 is used to store computer readable instructions.

The processor 620, for processing the memory-stored readable instructions, is capable of performing the various steps in the method of fig. 1.

Communication interface 630 is used for signaling or data communication with other node devices. For example: for communication with a server or terminal, or with other device nodes, the embodiments of the application are not limited in this regard.

Communication bus 640 for implementing direct connection communication of the above components.

The communication interface 630 of the device in the embodiment of the present application is used for performing signaling or data communication with other node devices. The memory 610 may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one disk memory. The memory 610 may also optionally be at least one storage device located remotely from the aforementioned processor. The memory 610 has stored therein computer readable instructions which, when executed by the processor 620, perform the method process described above in fig. 1. Processor 620 may be used on apparatus 400 and to perform the functions herein. By way of example, the processor 620 described above may be a general purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), an off-the-shelf programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, and the embodiments are not limited in this regard.

Referring to fig. 7, a schematic block diagram of an apparatus for predicting text prosody according to an embodiment of the present application may include a memory 710 and a processor 720. Optionally, the apparatus may further include: a communication interface 730, and a communication bus 740. The apparatus corresponds to the above embodiment of the method of fig. 2, and can perform the steps involved in the embodiment of the method of fig. 2, and specific functions of the apparatus may be found in the following description.

In particular, the memory 710 is used to store computer readable instructions.

Processor 720, which processes the memory-stored readable instructions, is capable of performing various steps in the method of fig. 2.

Communication interface 730 for communicating signaling or data with other node devices. For example: for communication with a server or terminal, or with other device nodes, the embodiments of the application are not limited in this regard.

A communication bus 740 for implementing direct connection communication of the above-described components.

The communication interface 730 of the device in the embodiment of the present application is used to perform signaling or data communication with other node devices. The memory 710 may be a high-speed RAM memory or a non-volatile memory, such as at least one disk memory. Memory 710 may optionally also be at least one storage device located remotely from the aforementioned processor. The memory 710 has stored therein computer readable instructions which, when executed by the processor 720, perform the method process described above in fig. 2. Processor 720 may be used on apparatus 500 and to perform the functions herein. By way of example, the processor 720 may be a general purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), an off-the-shelf programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, and the embodiments are not limited in this regard.

Embodiments of the present application also provide a readable storage medium, which when executed by a processor, performs a method process performed by an electronic device in the method embodiment shown in fig. 1 or fig. 2.

It will be clear to those skilled in the art that, for convenience and brevity of description, reference may be made to the corresponding procedure in the foregoing method for the specific working procedure of the apparatus described above, and this will not be repeated here.

In summary, the embodiments of the present application provide a method and apparatus for training a text prosody prediction model and predicting text prosody, where the method includes obtaining a plurality of time stamps corresponding to each voice in a plurality of sample pairs, where the sample pairs include a voice and a recognition text corresponding to the voice; performing prosody annotation on each text in a plurality of sample pairs according to a plurality of time stamps to obtain a plurality of annotated texts, wherein the prosody annotation comprises annotation at a voice pause; acquiring prosody information corresponding to each text in the plurality of marked texts, wherein the prosody information comprises the interval distance of prosody marking and the part of speech of adjacent words of the prosody marking; training the basic model through prosody information corresponding to each text in the plurality of sample pairs and each text in the plurality of labeling texts to obtain a text prosody prediction model. The text prosody prediction model obtained by the method can accurately predict the text prosody, and further can enable the intelligent robot to generate audio with natural prosody when the voice is synthesized subsequently.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other manners as well. The apparatus embodiments described above are merely illustrative, for example, flow diagrams and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, the functional modules in the embodiments of the present application may be integrated together to form a single part, or each module may exist alone, or two or more modules may be integrated to form a single part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application, and various modifications and variations may be suggested to one skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present application should be included in the protection scope of the present application. It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.

The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Claims

1. A method of training a text prosody prediction model, comprising:

acquiring a plurality of time stamps corresponding to each voice in a plurality of sample pairs, wherein the sample pairs comprise one voice and a recognition text corresponding to the voice;

performing prosodic annotation on each text in the plurality of sample pairs according to the plurality of time stamps to obtain a plurality of annotated texts, wherein the prosodic annotation comprises an annotation at a voice pause;

acquiring prosodic information corresponding to each text in the plurality of tagged texts, wherein the prosodic information comprises a spacing distance of prosodic tags and parts of speech of adjacent words of the prosodic tags;

2. The method of claim 1, wherein prior to the obtaining the plurality of time stamps for each of the plurality of pairs of samples, the method further comprises:

training a basic speech recognition model through part of the sample pairs in the plurality of sample pairs to obtain a speech recognition model;

Inputting the other sample pairs except the part of sample pairs into the voice recognition model to obtain a plurality of voice recognition results and a plurality of time stamps.

3. The method according to claim 1 or 2, wherein the separation distance comprises a separation distance of two adjacent prosody annotations among the one or more prosody annotations corresponding to the preset pause time range and a separation distance of a prosody annotation corresponding to the preset pause time range and a prosody annotation corresponding to a second adjacent preset pause time range;

4. The method according to claim 1 or 2, wherein prosody marking each text in the plurality of sample pairs according to the plurality of time stamps results in a plurality of marked text, comprising:

and marking each text in the sample pair by using a prosody marking symbol according to a preset pause time range corresponding to each voice pause time.

5. A method of predicting text prosody, comprising:

converting punctuation marks in the text to be predicted into corresponding prosodic marks to obtain a converted text, wherein different punctuation marks correspond to different preset pause time ranges, and different preset pause time ranges correspond to different prosodic marks;

inputting the converted text into a preset text prosody prediction model to obtain a text prosody prediction result, wherein the text prosody prediction result comprises text with prosody labels.

6. The method according to claim 5, wherein after the converted text is inputted into a preset text prosody prediction model to obtain a text prosody prediction result, the method further comprises:

converting the text with prosody marking in the text prosody prediction result into pinyin;

converting the pinyin to phonemes through a dictionary;

the phonemes are synthesized into audio with prosody by a speech conversion model.

7. The method according to claim 5 or 6, wherein converting punctuation marks in the text to be predicted into corresponding prosodic annotations to obtain converted text comprises:

separating words in the text after word segmentation by using preset symbols, and converting punctuation marks into corresponding prosody annotation marks to obtain the converted text.

8. The method of claim 5 or 6, wherein prior to said converting punctuation in the text to be predicted to a corresponding prosodic annotation, the method further comprises:

training a basic model through prosody information corresponding to each text in the plurality of sample pairs and each text in the plurality of labeling texts to obtain the text prosody prediction model.

9. An apparatus for training a text prosody prediction model, comprising:

a first obtaining module, configured to obtain a plurality of timestamps corresponding to each voice in a plurality of sample pairs, where the sample pairs include a voice and a recognition text corresponding to the voice;

the second acquisition module is used for acquiring prosody information corresponding to each text in the plurality of marked texts, wherein the prosody information comprises the interval distance of prosody marking and the part of speech of adjacent words of prosody marking;

10. An apparatus for predicting text prosody, comprising:

The prediction module is used for inputting the converted text into a preset text prosody prediction model to obtain a text prosody prediction result, wherein the text prosody prediction result comprises text with prosody annotation.