CN110797005B - Prosody prediction method, apparatus, device, and medium - Google Patents

Prosody prediction method, apparatus, device, and medium Download PDF

Info

Publication number
CN110797005B
CN110797005B CN201911072965.0A CN201911072965A CN110797005B CN 110797005 B CN110797005 B CN 110797005B CN 201911072965 A CN201911072965 A CN 201911072965A CN 110797005 B CN110797005 B CN 110797005B
Authority
CN
China
Prior art keywords
text
chinese
english
vector
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911072965.0A
Other languages
Chinese (zh)
Other versions
CN110797005A (en
Inventor
高占杰
聂志朋
卞衍尧
陈昌滨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Baidu Online Network Technology Beijing Co Ltd
Original Assignee
Baidu Online Network Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Baidu Online Network Technology Beijing Co Ltd filed Critical Baidu Online Network Technology Beijing Co Ltd
Priority to CN201911072965.0A priority Critical patent/CN110797005B/en
Publication of CN110797005A publication Critical patent/CN110797005A/en
Application granted granted Critical
Publication of CN110797005B publication Critical patent/CN110797005B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The embodiment of the application discloses a prosody prediction method, a prosody prediction device, prosody prediction equipment and a prosody prediction medium, which relate to the field of data processing, in particular to a speech synthesis technology. The method comprises the following steps: segmenting a Chinese and English mixed text to be predicted to obtain a Chinese text and an English text; determining word vectors of characters in a Chinese text and word vectors of words in an English text; and determining the prosody prediction result of the Chinese-English mixed text according to the determined word vector and the word vector. The embodiment of the application provides a prosody prediction method, a device, equipment and a medium, and improves the prosody prediction accuracy rate of Chinese and English mixed texts.

Description

Prosody prediction method, apparatus, device, and medium
Technical Field
The embodiment of the application relates to the field of data processing, in particular to a voice synthesis technology. Specifically, the present embodiment provides a prosody prediction method, apparatus, device, and medium.
Background
Prosody prediction needs to be performed on the text of speech before speech synthesis.
The conventional prosody prediction method comprises the following steps: predicting text content to be predicted according to a pre-trained prediction model by a machine learning method, and obtaining a pause prediction result corresponding to the text content, wherein the pause prediction result can comprise a pause position, a pause type (which can comprise long pause, short pause and the like) and a probability value corresponding to the pause type.
The above scheme has the following defects:
the language types of the text content to be predicted are not distinguished, and when the text content comprises both Chinese and English, namely the text to be predicted is a Chinese-English mixed text, English words are likely to be directly treated as a plurality of letters. However, directly treating english words as multiple letters can lose semantic information of the words, thereby reducing accuracy of text prosody prediction.
Disclosure of Invention
The embodiment of the application provides a prosody prediction method, a device, equipment and a medium, so as to improve the prosody prediction accuracy rate of Chinese and English mixed texts.
The embodiment of the application provides a prosody prediction method, which comprises the following steps:
segmenting a Chinese and English mixed text to be predicted to obtain a Chinese text and an English text;
determining word vectors of characters in a Chinese text and word vectors of words in an English text;
and determining the prosody prediction result of the Chinese-English mixed text according to the determined word vector and the word vector.
The method and the device for predicting the Chinese and English mixed text have the advantages that the Chinese text and the English text are obtained by segmenting the Chinese and English mixed text to be predicted; and determining the prosody prediction result of the Chinese-English mixed text according to the word vector of the Chinese character and the word vector of the word in the English text, thereby realizing prosody prediction of the Chinese-English mixed text.
And determining the prosody prediction result of the Chinese and English text in the Chinese and English mixed text according to the word vector of the word in the English text, wherein the semantic information of the word is reserved in the word vector of the word, so that the accuracy of prosody prediction of the Chinese and English text in the Chinese and English mixed text can be improved.
Further, the determining a prosody prediction result of the chinese-english mixed text according to the determined word vector and word vector includes:
according to the positions of characters and words in the Chinese-English mixed text, sequencing the character vectors of the characters and the word vectors of the words to generate a text vector sequence;
and determining a prosody prediction result of the Chinese-English mixed text according to the text vector sequence.
Based on the technical characteristics, the embodiment of the application can realize the following effects: and determining the prosody prediction result of the Chinese-English mixed text according to the text vector sequence, thereby realizing the purpose of determining the prosody prediction result of the Chinese-English mixed text by combining the positions of characters and words in the Chinese-English mixed text, and further improving the accuracy of the prosody prediction result.
Further, the determining a word vector of a word in the english text includes:
Segmenting words in an English text into letter sequences;
determining a letter vector for a letter in the letter sequence;
and extracting a word vector representing the word semantics according to the determined letter vector.
Based on the technical characteristics, the technical scheme of the embodiment of the application can realize the following effects: by segmenting words in an English text into letter sequences; determining an alphabet vector of a letter in the alphabet sequence; and extracting word vectors representing word semantics according to the determined letter vectors, thereby realizing the determination of the word vectors of the novel words which are not in the dictionary and further realizing the prosody prediction of the novel words.
Further, the extracting a word vector representing word semantics according to the determined letter vector includes:
sorting the letter vectors according to the arrangement positions of letters in the letter sequence to generate a letter vector sequence;
coding the letter vector sequence based on the letter vector attention distribution probability to generate semantic representation;
and decoding the semantic representation to obtain the word vector.
Based on the technical characteristics, the technical scheme of the embodiment of the application encodes the alphabet vector sequence to generate semantic representation by allocating probability based on the attention of the alphabet vector, thereby retaining the information of the alphabet, avoiding the loss of detailed information and further improving the determination accuracy of the word vector.
Further, determining a rhythm prediction result of the Chinese-English mixed text through a Chinese-English mixed rhythm recognition model;
the Chinese-English mixed prosody recognition model comprises a Chinese-English mixed language model and a prosody network layer; the Chinese-English mixed language model is obtained based on unsupervised learning training.
Based on the technical characteristics, the embodiment of the application can realize the following effects: by introducing the Chinese-English-mixed language model obtained based on unsupervised learning, the Chinese-English-mixed language model can be obtained through unsupervised learning, so that the training annotation data volume of the Chinese-English-mixed rhythm recognition model can be reduced, and the accuracy and the recall rate are effectively improved on the basis of rhythm annotation data of the same scale.
An embodiment of the present application further provides a prosody prediction device, including:
the text splitting module is used for splitting the Chinese and English mixed text to be predicted to obtain a Chinese text and an English text;
the word vector determining module is used for determining word vectors of characters in the Chinese text and word vectors of words in the English text;
and the result determining module is used for determining the prosody prediction result of the Chinese-English mixed text according to the determined word vector and the determined word vector.
Further, the result determination module includes:
the vector sorting unit is used for sorting the word vectors of the characters and the word vectors of the words according to the positions of the characters and the words in the Chinese-English mixed text to generate a text vector sequence;
and the result determining unit is used for determining the prosody prediction result of the Chinese-English mixed text according to the text vector sequence.
Further, the word vector determination module includes:
the letter segmentation unit is used for segmenting words in the English text into letter sequences;
a vector determination unit for determining an alphabet vector of the alphabet in the alphabet sequence;
and the vector extraction unit is used for extracting a word vector representing word semantics according to the determined letter vector.
Further, the vector extraction unit is specifically configured to:
sorting the letter vectors according to the arrangement positions of letters in the letter sequence to generate a letter vector sequence;
coding the letter vector sequence based on the letter vector attention distribution probability to generate semantic representation;
and decoding the semantic representation to obtain the word vector.
Further, determining a prosody prediction result of the Chinese-English mixed text through a Chinese-English mixed prosody recognition model;
The Chinese-English mixed prosody recognition model comprises a Chinese-English mixed language model and a prosody network layer; the Chinese-English mixed language model is obtained based on unsupervised learning training.
An embodiment of the present application further provides an electronic device, which includes:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the embodiments of the present application.
Embodiments of the present application also provide a non-transitory computer-readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any of the embodiments of the present application.
Drawings
The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:
FIG. 1 is a flow chart of a prosody prediction method according to a first embodiment of the present application;
FIG. 2 is a flowchart of a prosody prediction method according to a second embodiment of the present application;
FIG. 3 is a flowchart of a prosody prediction method according to a third embodiment of the present application;
FIG. 4 is a schematic diagram of a model structure of a prosody prediction method according to a fourth embodiment of the present application;
fig. 5 is a schematic structural diagram of a prosody prediction device according to a fifth embodiment of the present application;
fig. 6 is a block diagram of an electronic device of a prosody prediction method according to an embodiment of the present application.
Detailed Description
The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
First embodiment
Fig. 1 is a flowchart of a prosody prediction method according to a first embodiment of the present application. The embodiment can be applied to the case of accurately predicting the prosody of Chinese and English mixed texts. The method may be performed by a prosody prediction device, which may be implemented in software and/or hardware. Referring to fig. 1, the prosody prediction method provided in this embodiment includes:
S110, segmenting the Chinese and English mixed text to be predicted to obtain a Chinese text and an English text.
The Chinese-English mixed text includes both Chinese and English.
Chinese text is text that includes only chinese.
English text is text that includes only english.
The number of the Chinese text and the English text can be one, two or more.
And S120, determining word vectors of characters in the Chinese text and word vectors of words in the English text.
Specifically, the method for determining the word vector and the word vector is not limited here, and may be implemented according to any vector conversion method in the prior art.
S130, determining a prosody prediction result of the Chinese-English mixed text according to the determined word vector and the determined word vector.
Specifically, the prosodic prediction result may include a pause location, a pause type (which may include long pauses, short pauses, etc.), and a probability value corresponding to the pause type.
The step of determining the prosody prediction result of the Chinese-English mixed text according to the determined word vector and the word vector comprises the following steps of:
according to the positions of characters and words in the Chinese-English mixed text, sequencing the character vectors of the characters and the word vectors of the words to generate a text vector sequence;
And determining a prosody prediction result of the Chinese-English mixed text according to the text vector sequence.
Based on the technical characteristics, the embodiment of the application can realize the following effects: and determining the prosody prediction result of the Chinese-English mixed text according to the text vector sequence, thereby realizing the purpose of determining the prosody prediction result of the Chinese-English mixed text by combining the positions of characters and words in the Chinese-English mixed text, and further improving the accuracy of the prosody prediction result.
According to the technical scheme of the embodiment of the application, the Chinese text and the English text are obtained by segmenting the Chinese and English mixed text to be predicted; and determining the prosody prediction result of the Chinese-English mixed text according to the word vector of the Chinese character and the word vector of the word in the English text, thereby realizing prosody prediction of the Chinese-English mixed text.
And determining the prosody prediction result of the Chinese and English text in the Chinese and English mixed text according to the word vector of the word in the English text, wherein the semantic information of the word is reserved in the word vector of the word, so that the accuracy of prosody prediction of the Chinese and English text in the Chinese and English mixed text can be improved.
Second embodiment
Fig. 2 is a flowchart of a prosody prediction method according to a second embodiment of the present application. The present embodiment is an alternative proposed on the basis of the above-described embodiments. Referring to fig. 2, the prosody prediction method provided in this embodiment includes:
S210, segmenting the Chinese and English mixed text to be predicted to obtain a Chinese text and an English text.
S220, determining word vectors of characters in the Chinese text and word vectors of words in the English text.
Determining a word vector of a word in an English text comprises the following steps:
segmenting words in an English text into letter sequences;
determining an alphabet vector of a letter in the alphabet sequence;
and extracting a word vector representing word semantics according to the determined letter vector.
Specifically, extracting a word vector representing word semantics according to the determined letter vector comprises:
sorting the letter vectors according to the arrangement positions of letters in the letter sequence to generate a letter vector sequence;
encoding the letter vector sequence to generate semantic representation;
and decoding the semantic representation to obtain the word vector.
In order to improve the determination accuracy of the word vector, the letter vector sequence is encoded to generate semantic representation, and the semantic representation comprises the following steps:
and coding the letter vector sequence based on the letter vector attention distribution probability to generate semantic representation.
The letter vector attention assignment probability may be determined based on an attention mechanism.
Based on the technical characteristics, the technical scheme of the embodiment of the application encodes the alphabet vector sequence to generate semantic representation by allocating probability based on the attention of the alphabet vector, thereby retaining the information of the alphabet, avoiding the loss of detailed information and further improving the determination accuracy of the word vector.
And S230, determining a prosody prediction result of the Chinese-English mixed text according to the determined word vector and the determined word vector.
According to the technical scheme of the embodiment of the application, words in an English text are divided into letter sequences; determining an alphabet vector of a letter in the alphabet sequence; and extracting word vectors representing word semantics according to the determined letter vectors, thereby realizing the determination of the word vectors of the novel words which are not in the dictionary and further realizing the prosody prediction of the novel words.
Third embodiment
Fig. 3 is a flowchart of a prosody prediction method according to a third embodiment of the present application. The present embodiment is an alternative proposed on the basis of the above-described embodiments. Referring to fig. 3, the prosody prediction method provided in this embodiment includes:
s310, segmenting the Chinese and English mixed text to be predicted to obtain the Chinese text and the English text.
And S320, determining word vectors of characters in the Chinese text and word vectors of words in the English text.
S330, inputting the determined character vectors and word vectors into a Chinese-English mixed prosody recognition model, and outputting a prosody prediction result of the Chinese-English mixed text.
The Chinese-English mixed prosody recognition model is a model for performing prosody prediction on Chinese-English mixed texts. The model is obtained by utilizing a labeled sample to train in advance based on supervised learning.
The Chinese-English mixed prosody recognition model comprises a Chinese-English mixed language model and a prosody network layer; the Chinese-English mixed language model is obtained based on unsupervised learning training.
The Chinese-English mixed language model is used for extracting a word vector of a Chinese text and a word vector of an English text, and determining the semantic relation between characters or words in the Chinese-English mixed text according to the extracted word vector and word vector.
And the prosody network layer is used for converting the determined semantic relation between the characters or words in the Chinese and English mixed text into a prosody prediction result of the Chinese and English mixed text.
According to the technical scheme, the Chinese-English-mixed language model obtained based on unsupervised learning is introduced, and the Chinese-English-mixed language model can be obtained through unsupervised learning, so that the training annotation data volume of the Chinese-English-mixed rhythm recognition model can be reduced, and the accuracy and the recall rate are effectively improved on the basis of rhythm annotation data of the same scale.
Fourth embodiment
Fig. 4 is a schematic diagram of a model structure of a prosody prediction method according to a fourth embodiment of the present application. The present embodiment is an alternative proposed on the basis of the above-described embodiments. Referring to fig. 4, the prosody prediction method provided in this embodiment includes:
and segmenting the Chinese and English mixed text to be predicted to obtain the Chinese text and the English text.
Each word in the English text is segmented into letter sequences, and the Chinese text is segmented into character sequences.
And inputting the mixed sequence of the characters and the letters into a Chinese-English mixed prosody model, and outputting a prosody result.
The Chinese-English mixed prosody model comprises a Chinese-English mixed language model and a multi-layer fully-connected layer serving as a prosody network layer.
Firstly, a Chinese-English-mixed language model is unsupervised and trained through a large amount of pure Chinese, pure English and Chinese-English-mixed data. And after the Chinese-English mixed language model is obtained, splicing multiple fully-connected layers on the basis. And then, performing supervised learning on the whole model based on the Chinese-English mixed prosody labeling data to obtain the whole Chinese-English mixed prosody model.
When the Chinese-English mixed language model is trained, firstly, the input Chinese-English mixed text is cut and divided into characters and words. Where each word is further translated into a sequence of letters. On the basis, the Chinese characters can be directly indexed to the word vectors through the dictionary. The letter sequence of a word is indexed to the letter vector sequence by the letter dictionary and the letter sequence learns the word vector for the word through the full-concatenation layer and the attention mechanism.
The entire chinese-english sequence is then converted into a word vector of words and a word vector sequence of words. The vector sequence obtains a Chinese-English language model through a multi-layer conversion network layer.
According to the technical scheme of the embodiment of the application, the prosody prediction of the Chinese-English mixed text can be realized. The introduction of the Chinese-English mixed language model effectively improves the accuracy and the recall rate on the basis of prosody labeling data of the same scale. Namely, based on the pre-trained Chinese-English mixed language model, a Chinese-English mixed rhythm model with good precision and recall can be trained through little labeled data, and the problem of little labeled data of rhythm is solved.
In addition, the Chinese-English mixed language model carries out letter-level modeling on English words, and effectively improves the prediction accuracy of words which are not logged in a dictionary during Chinese-English mixed prosody prediction.
Fifth embodiment
Fig. 5 is a schematic structural diagram of a prosody prediction device according to a fifth embodiment of the present application. Referring to fig. 5, the prosody prediction device 500 provided in the present embodiment includes: a text splitting module 501, a word vector determination module 502, and a result determination module 503.
The text splitting module 501 is configured to split a Chinese-English mixed text to be predicted to obtain a Chinese text and an English text;
A word vector determining module 502, configured to determine word vectors of characters in a chinese text and word vectors of words in an english text;
and a result determining module 503, configured to determine a prosody prediction result of the chinese-english mixed text according to the determined word vector and word vector.
According to the technical scheme of the embodiment of the application, the Chinese text and the English text are obtained by segmenting the Chinese and English mixed text to be predicted; and determining the prosody prediction result of the Chinese-English mixed text according to the word vector of the Chinese character and the word vector of the word in the English text, thereby realizing prosody prediction of the Chinese-English mixed text.
Further, the result determination module includes:
the vector sorting unit is used for sorting the word vectors of the characters and the word vectors of the words according to the positions of the characters and the words in the Chinese-English mixed text to generate a text vector sequence;
and the result determining unit is used for determining the prosody prediction result of the Chinese-English mixed text according to the text vector sequence.
Further, the word vector determination module includes:
the letter segmentation unit is used for segmenting words in the English text into letter sequences;
A vector determination unit for determining a letter vector of a letter in the letter sequence;
and the vector extraction unit is used for extracting a word vector representing the word semantics according to the determined letter vector.
Further, the vector extraction unit is specifically configured to:
sorting the letter vectors according to the arrangement positions of letters in the letter sequence to generate a letter vector sequence;
coding the letter vector sequence based on the letter vector attention distribution probability to generate semantic representation;
and decoding the semantic representation to obtain the word vector.
Further, determining a prosody prediction result of the Chinese-English mixed text through a Chinese-English mixed prosody recognition model;
the Chinese-English mixed prosody recognition model comprises a Chinese-English mixed language model and a prosody network layer; the Chinese-English mixed language model is obtained based on unsupervised learning training.
According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.
As shown in fig. 6, the embodiment is a block diagram of an electronic device according to the prosody prediction method of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.
As shown in fig. 6, the electronic apparatus includes: one or more processors 601, memory 602, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, if desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 6, one processor 601 is taken as an example.
The memory 602 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform a prosody prediction method provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the prosody prediction method provided by the present application.
The memory 602, which is a non-transitory computer-readable storage medium, may be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules corresponding to the prosody prediction method in the embodiment of the present application (for example, the text splitting module 501, the word vector determination module 502, and the result determination module 503 in the prosody prediction apparatus 500 shown in fig. 5). The processor 601 executes various functional applications of the server and data processing, i.e., implements the prosody prediction method in the above-described method embodiments, by running non-transitory software programs, instructions, and modules stored in the memory 602.
The memory 602 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the prosody prediction electronic device, and the like. Further, the memory 602 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 602 may optionally include memory located remotely from the processor 601, which may be connected to the prosody prediction electronics over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The electronic device of the prosody prediction method may further include: an input device 603 and an output device 604. The processor 601, the memory 602, the input device 603 and the output device 604 may be connected by a bus or other means, and fig. 6 illustrates the connection by a bus as an example.
The input device 603 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the prosody predicting electronic device, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, or other input device. The output devices 604 may include a display device, auxiliary lighting devices (e.g., LEDs), and tactile feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.
Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present application can be achieved, and the present invention is not limited herein.
The above-described embodiments are not intended to limit the scope of the present disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (10)

1. A prosody prediction method, comprising:
segmenting a Chinese and English mixed text to be predicted to obtain a Chinese text and an English text;
determining word vectors of characters in a Chinese text and word vectors of words in an English text;
according to the positions of characters and words in the Chinese-English mixed text, sequencing the character vectors of the characters and the word vectors of the words to generate a text vector sequence;
determining a prosody prediction result of the Chinese-English mixed text according to the text vector sequence;
wherein the prosody prediction result comprises a pause position, a pause type and a probability value corresponding to the pause type.
2. The method of claim 1, wherein determining a word vector for a word in an english text comprises:
Dividing words in an English text into letter sequences;
determining an alphabet vector of a letter in the alphabet sequence;
and extracting a word vector representing word semantics according to the determined letter vector.
3. The method of claim 2, wherein extracting a word vector representing word semantics from the determined letter vector comprises:
sorting the letter vectors according to the arrangement positions of letters in the letter sequence to generate a letter vector sequence;
coding the letter vector sequence based on the letter vector attention distribution probability to generate semantic representation;
and decoding the semantic representation to obtain the word vector.
4. The method of claim 1, wherein the prosody prediction result of the chinese-to-english text is determined by a chinese-to-english prosody recognition model;
the Chinese-English mixed prosody recognition model comprises a Chinese-English mixed language model and a prosody network layer; the Chinese-English mixed language model is obtained based on unsupervised learning training.
5. A prosody prediction device, comprising:
the text splitting module is used for splitting the Chinese and English mixed text to be predicted to obtain a Chinese text and an English text;
The word vector determining module is used for determining word vectors of characters in the Chinese text and word vectors of words in the English text;
the result determining module is used for sequencing word vectors of the characters and word vectors of the words according to the positions of the characters and the words in the Chinese-English mixed text to generate a text vector sequence; determining a prosody prediction result of the Chinese-English mixed text according to the text vector sequence; wherein the prosody prediction result comprises a pause position, a pause type and a probability value corresponding to the pause type.
6. The apparatus of claim 5, wherein the word vector determination module comprises:
the letter segmentation unit is used for segmenting words in the English text into letter sequences;
a vector determination unit for determining an alphabet vector of the alphabet in the alphabet sequence;
and the vector extraction unit is used for extracting a word vector representing word semantics according to the determined letter vector.
7. The apparatus according to claim 6, wherein the vector extraction unit is specifically configured to:
sorting the letter vectors according to the arrangement positions of letters in the letter sequence to generate a letter vector sequence;
Coding the letter vector sequence based on the letter vector attention distribution probability to generate semantic representation;
and decoding the semantic representation to obtain the word vector.
8. The apparatus of claim 5, wherein the prosody prediction result of the Chinese-English mixed text is determined by a Chinese-English mixed prosody recognition model;
the Chinese-English mixed prosody recognition model comprises a Chinese-English mixed language model and a prosody network layer; the Chinese-English mixed language model is obtained based on unsupervised learning training.
9. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-4.
10. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-4.
CN201911072965.0A 2019-11-05 2019-11-05 Prosody prediction method, apparatus, device, and medium Active CN110797005B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911072965.0A CN110797005B (en) 2019-11-05 2019-11-05 Prosody prediction method, apparatus, device, and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911072965.0A CN110797005B (en) 2019-11-05 2019-11-05 Prosody prediction method, apparatus, device, and medium

Publications (2)

Publication Number Publication Date
CN110797005A CN110797005A (en) 2020-02-14
CN110797005B true CN110797005B (en) 2022-06-10

Family

ID=69442782

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911072965.0A Active CN110797005B (en) 2019-11-05 2019-11-05 Prosody prediction method, apparatus, device, and medium

Country Status (1)

Country Link
CN (1) CN110797005B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111724765B (en) * 2020-06-30 2023-07-25 度小满科技(北京)有限公司 Text-to-speech method and device and computer equipment
CN112216267A (en) * 2020-09-15 2021-01-12 北京捷通华声科技股份有限公司 Rhythm prediction method, device, equipment and storage medium
CN112131878B (en) * 2020-09-29 2022-05-31 腾讯科技(深圳)有限公司 Text processing method and device and computer equipment
CN112289305A (en) * 2020-11-23 2021-01-29 北京有竹居网络技术有限公司 Prosody prediction method, device, equipment and storage medium
CN112397050B (en) * 2020-11-25 2023-07-07 北京百度网讯科技有限公司 Prosody prediction method, training device, electronic equipment and medium
CN113327579A (en) * 2021-08-03 2021-08-31 北京世纪好未来教育科技有限公司 Speech synthesis method, speech synthesis device, storage medium and electronic equipment
CN113836305B (en) * 2021-09-29 2024-03-22 有米科技股份有限公司 Text-based industry category identification method and device
CN116665643B (en) * 2022-11-30 2024-03-26 荣耀终端有限公司 Rhythm marking method and device and terminal equipment

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1731510A (en) * 2004-08-05 2006-02-08 摩托罗拉公司 Text-speech conversion for amalgamated language
CN105118499A (en) * 2015-07-06 2015-12-02 百度在线网络技术(北京)有限公司 Rhythmic pause prediction method and apparatus
CN105989833A (en) * 2015-02-28 2016-10-05 讯飞智元信息科技有限公司 Multilingual mixed-language text character-pronunciation conversion method and system
CN106601228A (en) * 2016-12-09 2017-04-26 百度在线网络技术(北京)有限公司 Sample marking method and device based on artificial intelligence prosody prediction
CN107039034A (en) * 2016-02-04 2017-08-11 科大讯飞股份有限公司 A kind of prosody prediction method and system
CN108305612A (en) * 2017-11-21 2018-07-20 腾讯科技(深圳)有限公司 Text-processing, model training method, device, storage medium and computer equipment
CN109697973A (en) * 2019-01-22 2019-04-30 清华大学深圳研究生院 A kind of method, the method and device of model training of prosody hierarchy mark
CN110298035A (en) * 2019-06-04 2019-10-01 平安科技(深圳)有限公司 Word vector based on artificial intelligence defines method, apparatus, equipment and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101000764B (en) * 2006-12-18 2011-05-18 黑龙江大学 Speech synthetic text processing method based on rhythm structure
CN110176225B (en) * 2019-05-30 2021-08-13 科大讯飞股份有限公司 Method and device for evaluating rhythm prediction effect

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1731510A (en) * 2004-08-05 2006-02-08 摩托罗拉公司 Text-speech conversion for amalgamated language
CN105989833A (en) * 2015-02-28 2016-10-05 讯飞智元信息科技有限公司 Multilingual mixed-language text character-pronunciation conversion method and system
CN105118499A (en) * 2015-07-06 2015-12-02 百度在线网络技术(北京)有限公司 Rhythmic pause prediction method and apparatus
CN107039034A (en) * 2016-02-04 2017-08-11 科大讯飞股份有限公司 A kind of prosody prediction method and system
CN106601228A (en) * 2016-12-09 2017-04-26 百度在线网络技术(北京)有限公司 Sample marking method and device based on artificial intelligence prosody prediction
CN108305612A (en) * 2017-11-21 2018-07-20 腾讯科技(深圳)有限公司 Text-processing, model training method, device, storage medium and computer equipment
CN109697973A (en) * 2019-01-22 2019-04-30 清华大学深圳研究生院 A kind of method, the method and device of model training of prosody hierarchy mark
CN110298035A (en) * 2019-06-04 2019-10-01 平安科技(深圳)有限公司 Word vector based on artificial intelligence defines method, apparatus, equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种基于Tacotron2的端到端中文语音合成方案;王国梁等;《东北师范大学学报(自然科学版)》;20190731;全文 *

Also Published As

Publication number Publication date
CN110797005A (en) 2020-02-14

Similar Documents

Publication Publication Date Title
CN110797005B (en) Prosody prediction method, apparatus, device, and medium
CN111967268A (en) Method and device for extracting events in text, electronic equipment and storage medium
CN111274764B (en) Language generation method and device, computer equipment and storage medium
CN111078865B (en) Text title generation method and device
CN110619867B (en) Training method and device of speech synthesis model, electronic equipment and storage medium
CN111143561B (en) Intention recognition model training method and device and electronic equipment
CN110717327A (en) Title generation method and device, electronic equipment and storage medium
CN112633017B (en) Translation model training method, translation processing method, translation model training device, translation processing equipment and storage medium
CN111241832A (en) Core entity labeling method and device and electronic equipment
CN112365880A (en) Speech synthesis method, speech synthesis device, electronic equipment and storage medium
CN110807331B (en) Polyphone pronunciation prediction method and device and electronic equipment
CN112489637A (en) Speech recognition method and device
CN112270198B (en) Role determination method and device, electronic equipment and storage medium
CN111079945B (en) End-to-end model training method and device
CN112507735A (en) Training method and device of machine translation model and electronic equipment
JP2022151649A (en) Training method, device, equipment, and storage method for speech recognition model
CN112153206B (en) Contact person matching method and device, electronic equipment and storage medium
CN111950292A (en) Training method of text error correction model, and text error correction processing method and device
CN110782871B (en) Rhythm pause prediction method and device and electronic equipment
CN111858883A (en) Method and device for generating triple sample, electronic equipment and storage medium
CN111241810A (en) Punctuation prediction method and device
CN110767212B (en) Voice processing method and device and electronic equipment
CN111666759A (en) Method and device for extracting key information of text, electronic equipment and storage medium
CN112560499A (en) Pre-training method and device of semantic representation model, electronic equipment and storage medium
CN111738015A (en) Method and device for analyzing emotion polarity of article, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant