CN110797005B

CN110797005B - Prosody prediction method, apparatus, device, and medium

Info

Publication number: CN110797005B
Application number: CN201911072965.0A
Authority: CN
Inventors: 高占杰; 聂志朋; 卞衍尧; 陈昌滨
Original assignee: Baidu Online Network Technology Beijing Co Ltd
Current assignee: Baidu Online Network Technology Beijing Co Ltd
Priority date: 2019-11-05
Filing date: 2019-11-05
Publication date: 2022-06-10
Anticipated expiration: 2039-11-05
Also published as: CN110797005A

Abstract

The embodiment of the application discloses a prosody prediction method, a prosody prediction device, prosody prediction equipment and a prosody prediction medium, which relate to the field of data processing, in particular to a speech synthesis technology. The method comprises the following steps: segmenting a Chinese and English mixed text to be predicted to obtain a Chinese text and an English text; determining word vectors of characters in a Chinese text and word vectors of words in an English text; and determining the prosody prediction result of the Chinese-English mixed text according to the determined word vector and the word vector. The embodiment of the application provides a prosody prediction method, a device, equipment and a medium, and improves the prosody prediction accuracy rate of Chinese and English mixed texts.

Description

Prosody prediction method, apparatus, device, and medium

Technical Field

The embodiment of the application relates to the field of data processing, in particular to a voice synthesis technology. Specifically, the present embodiment provides a prosody prediction method, apparatus, device, and medium.

Background

Prosody prediction needs to be performed on the text of speech before speech synthesis.

The conventional prosody prediction method comprises the following steps: predicting text content to be predicted according to a pre-trained prediction model by a machine learning method, and obtaining a pause prediction result corresponding to the text content, wherein the pause prediction result can comprise a pause position, a pause type (which can comprise long pause, short pause and the like) and a probability value corresponding to the pause type.

The above scheme has the following defects:

the language types of the text content to be predicted are not distinguished, and when the text content comprises both Chinese and English, namely the text to be predicted is a Chinese-English mixed text, English words are likely to be directly treated as a plurality of letters. However, directly treating english words as multiple letters can lose semantic information of the words, thereby reducing accuracy of text prosody prediction.

Disclosure of Invention

The embodiment of the application provides a prosody prediction method, a device, equipment and a medium, so as to improve the prosody prediction accuracy rate of Chinese and English mixed texts.

The embodiment of the application provides a prosody prediction method, which comprises the following steps:

segmenting a Chinese and English mixed text to be predicted to obtain a Chinese text and an English text;

determining word vectors of characters in a Chinese text and word vectors of words in an English text;

and determining the prosody prediction result of the Chinese-English mixed text according to the determined word vector and the word vector.

The method and the device for predicting the Chinese and English mixed text have the advantages that the Chinese text and the English text are obtained by segmenting the Chinese and English mixed text to be predicted; and determining the prosody prediction result of the Chinese-English mixed text according to the word vector of the Chinese character and the word vector of the word in the English text, thereby realizing prosody prediction of the Chinese-English mixed text.

And determining the prosody prediction result of the Chinese and English text in the Chinese and English mixed text according to the word vector of the word in the English text, wherein the semantic information of the word is reserved in the word vector of the word, so that the accuracy of prosody prediction of the Chinese and English text in the Chinese and English mixed text can be improved.

Further, the determining a prosody prediction result of the chinese-english mixed text according to the determined word vector and word vector includes:

according to the positions of characters and words in the Chinese-English mixed text, sequencing the character vectors of the characters and the word vectors of the words to generate a text vector sequence;

and determining a prosody prediction result of the Chinese-English mixed text according to the text vector sequence.

Based on the technical characteristics, the embodiment of the application can realize the following effects: and determining the prosody prediction result of the Chinese-English mixed text according to the text vector sequence, thereby realizing the purpose of determining the prosody prediction result of the Chinese-English mixed text by combining the positions of characters and words in the Chinese-English mixed text, and further improving the accuracy of the prosody prediction result.

Further, the determining a word vector of a word in the english text includes:

Segmenting words in an English text into letter sequences;

determining a letter vector for a letter in the letter sequence;

and extracting a word vector representing the word semantics according to the determined letter vector.

Based on the technical characteristics, the technical scheme of the embodiment of the application can realize the following effects: by segmenting words in an English text into letter sequences; determining an alphabet vector of a letter in the alphabet sequence; and extracting word vectors representing word semantics according to the determined letter vectors, thereby realizing the determination of the word vectors of the novel words which are not in the dictionary and further realizing the prosody prediction of the novel words.

Further, the extracting a word vector representing word semantics according to the determined letter vector includes:

sorting the letter vectors according to the arrangement positions of letters in the letter sequence to generate a letter vector sequence;

coding the letter vector sequence based on the letter vector attention distribution probability to generate semantic representation;

and decoding the semantic representation to obtain the word vector.

Based on the technical characteristics, the technical scheme of the embodiment of the application encodes the alphabet vector sequence to generate semantic representation by allocating probability based on the attention of the alphabet vector, thereby retaining the information of the alphabet, avoiding the loss of detailed information and further improving the determination accuracy of the word vector.

Further, determining a rhythm prediction result of the Chinese-English mixed text through a Chinese-English mixed rhythm recognition model;

the Chinese-English mixed prosody recognition model comprises a Chinese-English mixed language model and a prosody network layer; the Chinese-English mixed language model is obtained based on unsupervised learning training.

Based on the technical characteristics, the embodiment of the application can realize the following effects: by introducing the Chinese-English-mixed language model obtained based on unsupervised learning, the Chinese-English-mixed language model can be obtained through unsupervised learning, so that the training annotation data volume of the Chinese-English-mixed rhythm recognition model can be reduced, and the accuracy and the recall rate are effectively improved on the basis of rhythm annotation data of the same scale.

An embodiment of the present application further provides a prosody prediction device, including:

the text splitting module is used for splitting the Chinese and English mixed text to be predicted to obtain a Chinese text and an English text;

the word vector determining module is used for determining word vectors of characters in the Chinese text and word vectors of words in the English text;

and the result determining module is used for determining the prosody prediction result of the Chinese-English mixed text according to the determined word vector and the determined word vector.

Further, the result determination module includes:

the vector sorting unit is used for sorting the word vectors of the characters and the word vectors of the words according to the positions of the characters and the words in the Chinese-English mixed text to generate a text vector sequence;

and the result determining unit is used for determining the prosody prediction result of the Chinese-English mixed text according to the text vector sequence.

Further, the word vector determination module includes:

the letter segmentation unit is used for segmenting words in the English text into letter sequences;

a vector determination unit for determining an alphabet vector of the alphabet in the alphabet sequence;

and the vector extraction unit is used for extracting a word vector representing word semantics according to the determined letter vector.

Further, the vector extraction unit is specifically configured to:

and decoding the semantic representation to obtain the word vector.

Further, determining a prosody prediction result of the Chinese-English mixed text through a Chinese-English mixed prosody recognition model;

An embodiment of the present application further provides an electronic device, which includes:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the embodiments of the present application.

Embodiments of the present application also provide a non-transitory computer-readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any of the embodiments of the present application.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is a flow chart of a prosody prediction method according to a first embodiment of the present application;

FIG. 2 is a flowchart of a prosody prediction method according to a second embodiment of the present application;

FIG. 3 is a flowchart of a prosody prediction method according to a third embodiment of the present application;

FIG. 4 is a schematic diagram of a model structure of a prosody prediction method according to a fourth embodiment of the present application;

fig. 5 is a schematic structural diagram of a prosody prediction device according to a fifth embodiment of the present application;

fig. 6 is a block diagram of an electronic device of a prosody prediction method according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

First embodiment

Fig. 1 is a flowchart of a prosody prediction method according to a first embodiment of the present application. The embodiment can be applied to the case of accurately predicting the prosody of Chinese and English mixed texts. The method may be performed by a prosody prediction device, which may be implemented in software and/or hardware. Referring to fig. 1, the prosody prediction method provided in this embodiment includes:

S110, segmenting the Chinese and English mixed text to be predicted to obtain a Chinese text and an English text.

The Chinese-English mixed text includes both Chinese and English.

Chinese text is text that includes only chinese.

English text is text that includes only english.

The number of the Chinese text and the English text can be one, two or more.

And S120, determining word vectors of characters in the Chinese text and word vectors of words in the English text.

Specifically, the method for determining the word vector and the word vector is not limited here, and may be implemented according to any vector conversion method in the prior art.

S130, determining a prosody prediction result of the Chinese-English mixed text according to the determined word vector and the determined word vector.

Specifically, the prosodic prediction result may include a pause location, a pause type (which may include long pauses, short pauses, etc.), and a probability value corresponding to the pause type.

The step of determining the prosody prediction result of the Chinese-English mixed text according to the determined word vector and the word vector comprises the following steps of:

According to the technical scheme of the embodiment of the application, the Chinese text and the English text are obtained by segmenting the Chinese and English mixed text to be predicted; and determining the prosody prediction result of the Chinese-English mixed text according to the word vector of the Chinese character and the word vector of the word in the English text, thereby realizing prosody prediction of the Chinese-English mixed text.

Second embodiment

Fig. 2 is a flowchart of a prosody prediction method according to a second embodiment of the present application. The present embodiment is an alternative proposed on the basis of the above-described embodiments. Referring to fig. 2, the prosody prediction method provided in this embodiment includes:

S210, segmenting the Chinese and English mixed text to be predicted to obtain a Chinese text and an English text.

S220, determining word vectors of characters in the Chinese text and word vectors of words in the English text.

Determining a word vector of a word in an English text comprises the following steps:

segmenting words in an English text into letter sequences;

determining an alphabet vector of a letter in the alphabet sequence;

and extracting a word vector representing word semantics according to the determined letter vector.

Specifically, extracting a word vector representing word semantics according to the determined letter vector comprises:

encoding the letter vector sequence to generate semantic representation;

and decoding the semantic representation to obtain the word vector.

In order to improve the determination accuracy of the word vector, the letter vector sequence is encoded to generate semantic representation, and the semantic representation comprises the following steps:

and coding the letter vector sequence based on the letter vector attention distribution probability to generate semantic representation.

The letter vector attention assignment probability may be determined based on an attention mechanism.

And S230, determining a prosody prediction result of the Chinese-English mixed text according to the determined word vector and the determined word vector.

According to the technical scheme of the embodiment of the application, words in an English text are divided into letter sequences; determining an alphabet vector of a letter in the alphabet sequence; and extracting word vectors representing word semantics according to the determined letter vectors, thereby realizing the determination of the word vectors of the novel words which are not in the dictionary and further realizing the prosody prediction of the novel words.

Third embodiment

Fig. 3 is a flowchart of a prosody prediction method according to a third embodiment of the present application. The present embodiment is an alternative proposed on the basis of the above-described embodiments. Referring to fig. 3, the prosody prediction method provided in this embodiment includes:

s310, segmenting the Chinese and English mixed text to be predicted to obtain the Chinese text and the English text.

And S320, determining word vectors of characters in the Chinese text and word vectors of words in the English text.

S330, inputting the determined character vectors and word vectors into a Chinese-English mixed prosody recognition model, and outputting a prosody prediction result of the Chinese-English mixed text.

The Chinese-English mixed prosody recognition model is a model for performing prosody prediction on Chinese-English mixed texts. The model is obtained by utilizing a labeled sample to train in advance based on supervised learning.

The Chinese-English mixed language model is used for extracting a word vector of a Chinese text and a word vector of an English text, and determining the semantic relation between characters or words in the Chinese-English mixed text according to the extracted word vector and word vector.

And the prosody network layer is used for converting the determined semantic relation between the characters or words in the Chinese and English mixed text into a prosody prediction result of the Chinese and English mixed text.

According to the technical scheme, the Chinese-English-mixed language model obtained based on unsupervised learning is introduced, and the Chinese-English-mixed language model can be obtained through unsupervised learning, so that the training annotation data volume of the Chinese-English-mixed rhythm recognition model can be reduced, and the accuracy and the recall rate are effectively improved on the basis of rhythm annotation data of the same scale.

Fourth embodiment

Fig. 4 is a schematic diagram of a model structure of a prosody prediction method according to a fourth embodiment of the present application. The present embodiment is an alternative proposed on the basis of the above-described embodiments. Referring to fig. 4, the prosody prediction method provided in this embodiment includes:

and segmenting the Chinese and English mixed text to be predicted to obtain the Chinese text and the English text.

Each word in the English text is segmented into letter sequences, and the Chinese text is segmented into character sequences.

And inputting the mixed sequence of the characters and the letters into a Chinese-English mixed prosody model, and outputting a prosody result.

The Chinese-English mixed prosody model comprises a Chinese-English mixed language model and a multi-layer fully-connected layer serving as a prosody network layer.

Firstly, a Chinese-English-mixed language model is unsupervised and trained through a large amount of pure Chinese, pure English and Chinese-English-mixed data. And after the Chinese-English mixed language model is obtained, splicing multiple fully-connected layers on the basis. And then, performing supervised learning on the whole model based on the Chinese-English mixed prosody labeling data to obtain the whole Chinese-English mixed prosody model.

When the Chinese-English mixed language model is trained, firstly, the input Chinese-English mixed text is cut and divided into characters and words. Where each word is further translated into a sequence of letters. On the basis, the Chinese characters can be directly indexed to the word vectors through the dictionary. The letter sequence of a word is indexed to the letter vector sequence by the letter dictionary and the letter sequence learns the word vector for the word through the full-concatenation layer and the attention mechanism.

The entire chinese-english sequence is then converted into a word vector of words and a word vector sequence of words. The vector sequence obtains a Chinese-English language model through a multi-layer conversion network layer.

According to the technical scheme of the embodiment of the application, the prosody prediction of the Chinese-English mixed text can be realized. The introduction of the Chinese-English mixed language model effectively improves the accuracy and the recall rate on the basis of prosody labeling data of the same scale. Namely, based on the pre-trained Chinese-English mixed language model, a Chinese-English mixed rhythm model with good precision and recall can be trained through little labeled data, and the problem of little labeled data of rhythm is solved.

In addition, the Chinese-English mixed language model carries out letter-level modeling on English words, and effectively improves the prediction accuracy of words which are not logged in a dictionary during Chinese-English mixed prosody prediction.

Fifth embodiment

Fig. 5 is a schematic structural diagram of a prosody prediction device according to a fifth embodiment of the present application. Referring to fig. 5, the prosody prediction device 500 provided in the present embodiment includes: a text splitting module 501, a word vector determination module 502, and a result determination module 503.

The text splitting module 501 is configured to split a Chinese-English mixed text to be predicted to obtain a Chinese text and an English text;

A word vector determining module 502, configured to determine word vectors of characters in a chinese text and word vectors of words in an english text;

and a result determining module 503, configured to determine a prosody prediction result of the chinese-english mixed text according to the determined word vector and word vector.

Further, the result determination module includes:

Further, the word vector determination module includes:

A vector determination unit for determining a letter vector of a letter in the letter sequence;

and the vector extraction unit is used for extracting a word vector representing the word semantics according to the determined letter vector.

Further, the vector extraction unit is specifically configured to:

and decoding the semantic representation to obtain the word vector.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

As shown in fig. 6, the embodiment is a block diagram of an electronic device according to the prosody prediction method of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 6, the electronic apparatus includes: one or more processors 601, memory 602, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, if desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 6, one processor 601 is taken as an example.

The memory 602 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform a prosody prediction method provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the prosody prediction method provided by the present application.

The memory 602, which is a non-transitory computer-readable storage medium, may be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules corresponding to the prosody prediction method in the embodiment of the present application (for example, the text splitting module 501, the word vector determination module 502, and the result determination module 503 in the prosody prediction apparatus 500 shown in fig. 5). The processor 601 executes various functional applications of the server and data processing, i.e., implements the prosody prediction method in the above-described method embodiments, by running non-transitory software programs, instructions, and modules stored in the memory 602.

The memory 602 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the prosody prediction electronic device, and the like. Further, the memory 602 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 602 may optionally include memory located remotely from the processor 601, which may be connected to the prosody prediction electronics over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the prosody prediction method may further include: an input device 603 and an output device 604. The processor 601, the memory 602, the input device 603 and the output device 604 may be connected by a bus or other means, and fig. 6 illustrates the connection by a bus as an example.

The input device 603 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the prosody predicting electronic device, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, or other input device. The output devices 604 may include a display device, auxiliary lighting devices (e.g., LEDs), and tactile feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present application can be achieved, and the present invention is not limited herein.

The above-described embodiments are not intended to limit the scope of the present disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A prosody prediction method, comprising:

determining a prosody prediction result of the Chinese-English mixed text according to the text vector sequence;

wherein the prosody prediction result comprises a pause position, a pause type and a probability value corresponding to the pause type.

2. The method of claim 1, wherein determining a word vector for a word in an english text comprises:

Dividing words in an English text into letter sequences;

determining an alphabet vector of a letter in the alphabet sequence;

3. The method of claim 2, wherein extracting a word vector representing word semantics from the determined letter vector comprises:

and decoding the semantic representation to obtain the word vector.

4. The method of claim 1, wherein the prosody prediction result of the chinese-to-english text is determined by a chinese-to-english prosody recognition model;

5. A prosody prediction device, comprising:

the result determining module is used for sequencing word vectors of the characters and word vectors of the words according to the positions of the characters and the words in the Chinese-English mixed text to generate a text vector sequence; determining a prosody prediction result of the Chinese-English mixed text according to the text vector sequence; wherein the prosody prediction result comprises a pause position, a pause type and a probability value corresponding to the pause type.

6. The apparatus of claim 5, wherein the word vector determination module comprises:

7. The apparatus according to claim 6, wherein the vector extraction unit is specifically configured to:

and decoding the semantic representation to obtain the word vector.

8. The apparatus of claim 5, wherein the prosody prediction result of the Chinese-English mixed text is determined by a Chinese-English mixed prosody recognition model;

9. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-4.

10. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-4.