CN110797005B - Prosody prediction method, apparatus, device, and medium - Google Patents
Prosody prediction method, apparatus, device, and medium Download PDFInfo
- Publication number
- CN110797005B CN110797005B CN201911072965.0A CN201911072965A CN110797005B CN 110797005 B CN110797005 B CN 110797005B CN 201911072965 A CN201911072965 A CN 201911072965A CN 110797005 B CN110797005 B CN 110797005B
- Authority
- CN
- China
- Prior art keywords
- text
- chinese
- english
- vector
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 46
- 239000013598 vector Substances 0.000 claims abstract description 179
- 230000015654 memory Effects 0.000 claims description 19
- 238000012549 training Methods 0.000 claims description 8
- 238000000605 extraction Methods 0.000 claims description 6
- 238000012163 sequencing technique Methods 0.000 claims description 4
- 230000011218 segmentation Effects 0.000 claims description 3
- 239000000126 substance Substances 0.000 claims description 2
- 238000012545 processing Methods 0.000 abstract description 4
- 230000015572 biosynthetic process Effects 0.000 abstract description 3
- 238000003786 synthesis reaction Methods 0.000 abstract description 3
- 238000005516 engineering process Methods 0.000 abstract description 2
- 230000033764 rhythmic process Effects 0.000 description 8
- 238000010586 diagram Methods 0.000 description 6
- 238000004891 communication Methods 0.000 description 4
- 238000004590 computer program Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 230000004048 modification Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 238000006243 chemical reaction Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000002372 labelling Methods 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 230000001413 cellular effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L13/10—Prosody rules derived from text; Stress or intonation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Machine Translation (AREA)
- Document Processing Apparatus (AREA)
Abstract
The embodiment of the application discloses a prosody prediction method, a prosody prediction device, prosody prediction equipment and a prosody prediction medium, which relate to the field of data processing, in particular to a speech synthesis technology. The method comprises the following steps: segmenting a Chinese and English mixed text to be predicted to obtain a Chinese text and an English text; determining word vectors of characters in a Chinese text and word vectors of words in an English text; and determining the prosody prediction result of the Chinese-English mixed text according to the determined word vector and the word vector. The embodiment of the application provides a prosody prediction method, a device, equipment and a medium, and improves the prosody prediction accuracy rate of Chinese and English mixed texts.
Description
Technical Field
The embodiment of the application relates to the field of data processing, in particular to a voice synthesis technology. Specifically, the present embodiment provides a prosody prediction method, apparatus, device, and medium.
Background
Prosody prediction needs to be performed on the text of speech before speech synthesis.
The conventional prosody prediction method comprises the following steps: predicting text content to be predicted according to a pre-trained prediction model by a machine learning method, and obtaining a pause prediction result corresponding to the text content, wherein the pause prediction result can comprise a pause position, a pause type (which can comprise long pause, short pause and the like) and a probability value corresponding to the pause type.
The above scheme has the following defects:
the language types of the text content to be predicted are not distinguished, and when the text content comprises both Chinese and English, namely the text to be predicted is a Chinese-English mixed text, English words are likely to be directly treated as a plurality of letters. However, directly treating english words as multiple letters can lose semantic information of the words, thereby reducing accuracy of text prosody prediction.
Disclosure of Invention
The embodiment of the application provides a prosody prediction method, a device, equipment and a medium, so as to improve the prosody prediction accuracy rate of Chinese and English mixed texts.
The embodiment of the application provides a prosody prediction method, which comprises the following steps:
segmenting a Chinese and English mixed text to be predicted to obtain a Chinese text and an English text;
determining word vectors of characters in a Chinese text and word vectors of words in an English text;
and determining the prosody prediction result of the Chinese-English mixed text according to the determined word vector and the word vector.
The method and the device for predicting the Chinese and English mixed text have the advantages that the Chinese text and the English text are obtained by segmenting the Chinese and English mixed text to be predicted; and determining the prosody prediction result of the Chinese-English mixed text according to the word vector of the Chinese character and the word vector of the word in the English text, thereby realizing prosody prediction of the Chinese-English mixed text.
And determining the prosody prediction result of the Chinese and English text in the Chinese and English mixed text according to the word vector of the word in the English text, wherein the semantic information of the word is reserved in the word vector of the word, so that the accuracy of prosody prediction of the Chinese and English text in the Chinese and English mixed text can be improved.
Further, the determining a prosody prediction result of the chinese-english mixed text according to the determined word vector and word vector includes:
according to the positions of characters and words in the Chinese-English mixed text, sequencing the character vectors of the characters and the word vectors of the words to generate a text vector sequence;
and determining a prosody prediction result of the Chinese-English mixed text according to the text vector sequence.
Based on the technical characteristics, the embodiment of the application can realize the following effects: and determining the prosody prediction result of the Chinese-English mixed text according to the text vector sequence, thereby realizing the purpose of determining the prosody prediction result of the Chinese-English mixed text by combining the positions of characters and words in the Chinese-English mixed text, and further improving the accuracy of the prosody prediction result.
Further, the determining a word vector of a word in the english text includes:
Segmenting words in an English text into letter sequences;
determining a letter vector for a letter in the letter sequence;
and extracting a word vector representing the word semantics according to the determined letter vector.
Based on the technical characteristics, the technical scheme of the embodiment of the application can realize the following effects: by segmenting words in an English text into letter sequences; determining an alphabet vector of a letter in the alphabet sequence; and extracting word vectors representing word semantics according to the determined letter vectors, thereby realizing the determination of the word vectors of the novel words which are not in the dictionary and further realizing the prosody prediction of the novel words.
Further, the extracting a word vector representing word semantics according to the determined letter vector includes:
sorting the letter vectors according to the arrangement positions of letters in the letter sequence to generate a letter vector sequence;
coding the letter vector sequence based on the letter vector attention distribution probability to generate semantic representation;
and decoding the semantic representation to obtain the word vector.
Based on the technical characteristics, the technical scheme of the embodiment of the application encodes the alphabet vector sequence to generate semantic representation by allocating probability based on the attention of the alphabet vector, thereby retaining the information of the alphabet, avoiding the loss of detailed information and further improving the determination accuracy of the word vector.
Further, determining a rhythm prediction result of the Chinese-English mixed text through a Chinese-English mixed rhythm recognition model;
the Chinese-English mixed prosody recognition model comprises a Chinese-English mixed language model and a prosody network layer; the Chinese-English mixed language model is obtained based on unsupervised learning training.
Based on the technical characteristics, the embodiment of the application can realize the following effects: by introducing the Chinese-English-mixed language model obtained based on unsupervised learning, the Chinese-English-mixed language model can be obtained through unsupervised learning, so that the training annotation data volume of the Chinese-English-mixed rhythm recognition model can be reduced, and the accuracy and the recall rate are effectively improved on the basis of rhythm annotation data of the same scale.
An embodiment of the present application further provides a prosody prediction device, including:
the text splitting module is used for splitting the Chinese and English mixed text to be predicted to obtain a Chinese text and an English text;
the word vector determining module is used for determining word vectors of characters in the Chinese text and word vectors of words in the English text;
and the result determining module is used for determining the prosody prediction result of the Chinese-English mixed text according to the determined word vector and the determined word vector.
Further, the result determination module includes:
the vector sorting unit is used for sorting the word vectors of the characters and the word vectors of the words according to the positions of the characters and the words in the Chinese-English mixed text to generate a text vector sequence;
and the result determining unit is used for determining the prosody prediction result of the Chinese-English mixed text according to the text vector sequence.
Further, the word vector determination module includes:
the letter segmentation unit is used for segmenting words in the English text into letter sequences;
a vector determination unit for determining an alphabet vector of the alphabet in the alphabet sequence;
and the vector extraction unit is used for extracting a word vector representing word semantics according to the determined letter vector.
Further, the vector extraction unit is specifically configured to:
sorting the letter vectors according to the arrangement positions of letters in the letter sequence to generate a letter vector sequence;
coding the letter vector sequence based on the letter vector attention distribution probability to generate semantic representation;
and decoding the semantic representation to obtain the word vector.
Further, determining a prosody prediction result of the Chinese-English mixed text through a Chinese-English mixed prosody recognition model;
The Chinese-English mixed prosody recognition model comprises a Chinese-English mixed language model and a prosody network layer; the Chinese-English mixed language model is obtained based on unsupervised learning training.
An embodiment of the present application further provides an electronic device, which includes:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the embodiments of the present application.
Embodiments of the present application also provide a non-transitory computer-readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any of the embodiments of the present application.
Drawings
The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:
FIG. 1 is a flow chart of a prosody prediction method according to a first embodiment of the present application;
FIG. 2 is a flowchart of a prosody prediction method according to a second embodiment of the present application;
FIG. 3 is a flowchart of a prosody prediction method according to a third embodiment of the present application;
FIG. 4 is a schematic diagram of a model structure of a prosody prediction method according to a fourth embodiment of the present application;
fig. 5 is a schematic structural diagram of a prosody prediction device according to a fifth embodiment of the present application;
fig. 6 is a block diagram of an electronic device of a prosody prediction method according to an embodiment of the present application.
Detailed Description
The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
First embodiment
Fig. 1 is a flowchart of a prosody prediction method according to a first embodiment of the present application. The embodiment can be applied to the case of accurately predicting the prosody of Chinese and English mixed texts. The method may be performed by a prosody prediction device, which may be implemented in software and/or hardware. Referring to fig. 1, the prosody prediction method provided in this embodiment includes:
S110, segmenting the Chinese and English mixed text to be predicted to obtain a Chinese text and an English text.
The Chinese-English mixed text includes both Chinese and English.
Chinese text is text that includes only chinese.
English text is text that includes only english.
The number of the Chinese text and the English text can be one, two or more.
And S120, determining word vectors of characters in the Chinese text and word vectors of words in the English text.
Specifically, the method for determining the word vector and the word vector is not limited here, and may be implemented according to any vector conversion method in the prior art.
S130, determining a prosody prediction result of the Chinese-English mixed text according to the determined word vector and the determined word vector.
Specifically, the prosodic prediction result may include a pause location, a pause type (which may include long pauses, short pauses, etc.), and a probability value corresponding to the pause type.
The step of determining the prosody prediction result of the Chinese-English mixed text according to the determined word vector and the word vector comprises the following steps of:
according to the positions of characters and words in the Chinese-English mixed text, sequencing the character vectors of the characters and the word vectors of the words to generate a text vector sequence;
And determining a prosody prediction result of the Chinese-English mixed text according to the text vector sequence.
Based on the technical characteristics, the embodiment of the application can realize the following effects: and determining the prosody prediction result of the Chinese-English mixed text according to the text vector sequence, thereby realizing the purpose of determining the prosody prediction result of the Chinese-English mixed text by combining the positions of characters and words in the Chinese-English mixed text, and further improving the accuracy of the prosody prediction result.
According to the technical scheme of the embodiment of the application, the Chinese text and the English text are obtained by segmenting the Chinese and English mixed text to be predicted; and determining the prosody prediction result of the Chinese-English mixed text according to the word vector of the Chinese character and the word vector of the word in the English text, thereby realizing prosody prediction of the Chinese-English mixed text.
And determining the prosody prediction result of the Chinese and English text in the Chinese and English mixed text according to the word vector of the word in the English text, wherein the semantic information of the word is reserved in the word vector of the word, so that the accuracy of prosody prediction of the Chinese and English text in the Chinese and English mixed text can be improved.
Second embodiment
Fig. 2 is a flowchart of a prosody prediction method according to a second embodiment of the present application. The present embodiment is an alternative proposed on the basis of the above-described embodiments. Referring to fig. 2, the prosody prediction method provided in this embodiment includes:
S210, segmenting the Chinese and English mixed text to be predicted to obtain a Chinese text and an English text.
S220, determining word vectors of characters in the Chinese text and word vectors of words in the English text.
Determining a word vector of a word in an English text comprises the following steps:
segmenting words in an English text into letter sequences;
determining an alphabet vector of a letter in the alphabet sequence;
and extracting a word vector representing word semantics according to the determined letter vector.
Specifically, extracting a word vector representing word semantics according to the determined letter vector comprises:
sorting the letter vectors according to the arrangement positions of letters in the letter sequence to generate a letter vector sequence;
encoding the letter vector sequence to generate semantic representation;
and decoding the semantic representation to obtain the word vector.
In order to improve the determination accuracy of the word vector, the letter vector sequence is encoded to generate semantic representation, and the semantic representation comprises the following steps:
and coding the letter vector sequence based on the letter vector attention distribution probability to generate semantic representation.
The letter vector attention assignment probability may be determined based on an attention mechanism.
Based on the technical characteristics, the technical scheme of the embodiment of the application encodes the alphabet vector sequence to generate semantic representation by allocating probability based on the attention of the alphabet vector, thereby retaining the information of the alphabet, avoiding the loss of detailed information and further improving the determination accuracy of the word vector.
And S230, determining a prosody prediction result of the Chinese-English mixed text according to the determined word vector and the determined word vector.
According to the technical scheme of the embodiment of the application, words in an English text are divided into letter sequences; determining an alphabet vector of a letter in the alphabet sequence; and extracting word vectors representing word semantics according to the determined letter vectors, thereby realizing the determination of the word vectors of the novel words which are not in the dictionary and further realizing the prosody prediction of the novel words.
Third embodiment
Fig. 3 is a flowchart of a prosody prediction method according to a third embodiment of the present application. The present embodiment is an alternative proposed on the basis of the above-described embodiments. Referring to fig. 3, the prosody prediction method provided in this embodiment includes:
s310, segmenting the Chinese and English mixed text to be predicted to obtain the Chinese text and the English text.
And S320, determining word vectors of characters in the Chinese text and word vectors of words in the English text.
S330, inputting the determined character vectors and word vectors into a Chinese-English mixed prosody recognition model, and outputting a prosody prediction result of the Chinese-English mixed text.
The Chinese-English mixed prosody recognition model is a model for performing prosody prediction on Chinese-English mixed texts. The model is obtained by utilizing a labeled sample to train in advance based on supervised learning.
The Chinese-English mixed prosody recognition model comprises a Chinese-English mixed language model and a prosody network layer; the Chinese-English mixed language model is obtained based on unsupervised learning training.
The Chinese-English mixed language model is used for extracting a word vector of a Chinese text and a word vector of an English text, and determining the semantic relation between characters or words in the Chinese-English mixed text according to the extracted word vector and word vector.
And the prosody network layer is used for converting the determined semantic relation between the characters or words in the Chinese and English mixed text into a prosody prediction result of the Chinese and English mixed text.
According to the technical scheme, the Chinese-English-mixed language model obtained based on unsupervised learning is introduced, and the Chinese-English-mixed language model can be obtained through unsupervised learning, so that the training annotation data volume of the Chinese-English-mixed rhythm recognition model can be reduced, and the accuracy and the recall rate are effectively improved on the basis of rhythm annotation data of the same scale.
Fourth embodiment
Fig. 4 is a schematic diagram of a model structure of a prosody prediction method according to a fourth embodiment of the present application. The present embodiment is an alternative proposed on the basis of the above-described embodiments. Referring to fig. 4, the prosody prediction method provided in this embodiment includes:
and segmenting the Chinese and English mixed text to be predicted to obtain the Chinese text and the English text.
Each word in the English text is segmented into letter sequences, and the Chinese text is segmented into character sequences.
And inputting the mixed sequence of the characters and the letters into a Chinese-English mixed prosody model, and outputting a prosody result.
The Chinese-English mixed prosody model comprises a Chinese-English mixed language model and a multi-layer fully-connected layer serving as a prosody network layer.
Firstly, a Chinese-English-mixed language model is unsupervised and trained through a large amount of pure Chinese, pure English and Chinese-English-mixed data. And after the Chinese-English mixed language model is obtained, splicing multiple fully-connected layers on the basis. And then, performing supervised learning on the whole model based on the Chinese-English mixed prosody labeling data to obtain the whole Chinese-English mixed prosody model.
When the Chinese-English mixed language model is trained, firstly, the input Chinese-English mixed text is cut and divided into characters and words. Where each word is further translated into a sequence of letters. On the basis, the Chinese characters can be directly indexed to the word vectors through the dictionary. The letter sequence of a word is indexed to the letter vector sequence by the letter dictionary and the letter sequence learns the word vector for the word through the full-concatenation layer and the attention mechanism.
The entire chinese-english sequence is then converted into a word vector of words and a word vector sequence of words. The vector sequence obtains a Chinese-English language model through a multi-layer conversion network layer.
According to the technical scheme of the embodiment of the application, the prosody prediction of the Chinese-English mixed text can be realized. The introduction of the Chinese-English mixed language model effectively improves the accuracy and the recall rate on the basis of prosody labeling data of the same scale. Namely, based on the pre-trained Chinese-English mixed language model, a Chinese-English mixed rhythm model with good precision and recall can be trained through little labeled data, and the problem of little labeled data of rhythm is solved.
In addition, the Chinese-English mixed language model carries out letter-level modeling on English words, and effectively improves the prediction accuracy of words which are not logged in a dictionary during Chinese-English mixed prosody prediction.
Fifth embodiment
Fig. 5 is a schematic structural diagram of a prosody prediction device according to a fifth embodiment of the present application. Referring to fig. 5, the prosody prediction device 500 provided in the present embodiment includes: a text splitting module 501, a word vector determination module 502, and a result determination module 503.
The text splitting module 501 is configured to split a Chinese-English mixed text to be predicted to obtain a Chinese text and an English text;
A word vector determining module 502, configured to determine word vectors of characters in a chinese text and word vectors of words in an english text;
and a result determining module 503, configured to determine a prosody prediction result of the chinese-english mixed text according to the determined word vector and word vector.
According to the technical scheme of the embodiment of the application, the Chinese text and the English text are obtained by segmenting the Chinese and English mixed text to be predicted; and determining the prosody prediction result of the Chinese-English mixed text according to the word vector of the Chinese character and the word vector of the word in the English text, thereby realizing prosody prediction of the Chinese-English mixed text.
Further, the result determination module includes:
the vector sorting unit is used for sorting the word vectors of the characters and the word vectors of the words according to the positions of the characters and the words in the Chinese-English mixed text to generate a text vector sequence;
and the result determining unit is used for determining the prosody prediction result of the Chinese-English mixed text according to the text vector sequence.
Further, the word vector determination module includes:
the letter segmentation unit is used for segmenting words in the English text into letter sequences;
A vector determination unit for determining a letter vector of a letter in the letter sequence;
and the vector extraction unit is used for extracting a word vector representing the word semantics according to the determined letter vector.
Further, the vector extraction unit is specifically configured to:
sorting the letter vectors according to the arrangement positions of letters in the letter sequence to generate a letter vector sequence;
coding the letter vector sequence based on the letter vector attention distribution probability to generate semantic representation;
and decoding the semantic representation to obtain the word vector.
Further, determining a prosody prediction result of the Chinese-English mixed text through a Chinese-English mixed prosody recognition model;
the Chinese-English mixed prosody recognition model comprises a Chinese-English mixed language model and a prosody network layer; the Chinese-English mixed language model is obtained based on unsupervised learning training.
According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.
As shown in fig. 6, the embodiment is a block diagram of an electronic device according to the prosody prediction method of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.
As shown in fig. 6, the electronic apparatus includes: one or more processors 601, memory 602, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, if desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 6, one processor 601 is taken as an example.
The memory 602 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform a prosody prediction method provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the prosody prediction method provided by the present application.
The memory 602, which is a non-transitory computer-readable storage medium, may be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules corresponding to the prosody prediction method in the embodiment of the present application (for example, the text splitting module 501, the word vector determination module 502, and the result determination module 503 in the prosody prediction apparatus 500 shown in fig. 5). The processor 601 executes various functional applications of the server and data processing, i.e., implements the prosody prediction method in the above-described method embodiments, by running non-transitory software programs, instructions, and modules stored in the memory 602.
The memory 602 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the prosody prediction electronic device, and the like. Further, the memory 602 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 602 may optionally include memory located remotely from the processor 601, which may be connected to the prosody prediction electronics over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The electronic device of the prosody prediction method may further include: an input device 603 and an output device 604. The processor 601, the memory 602, the input device 603 and the output device 604 may be connected by a bus or other means, and fig. 6 illustrates the connection by a bus as an example.
The input device 603 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the prosody predicting electronic device, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, or other input device. The output devices 604 may include a display device, auxiliary lighting devices (e.g., LEDs), and tactile feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.
Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present application can be achieved, and the present invention is not limited herein.
The above-described embodiments are not intended to limit the scope of the present disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.
Claims (10)
1. A prosody prediction method, comprising:
segmenting a Chinese and English mixed text to be predicted to obtain a Chinese text and an English text;
determining word vectors of characters in a Chinese text and word vectors of words in an English text;
according to the positions of characters and words in the Chinese-English mixed text, sequencing the character vectors of the characters and the word vectors of the words to generate a text vector sequence;
determining a prosody prediction result of the Chinese-English mixed text according to the text vector sequence;
wherein the prosody prediction result comprises a pause position, a pause type and a probability value corresponding to the pause type.
2. The method of claim 1, wherein determining a word vector for a word in an english text comprises:
Dividing words in an English text into letter sequences;
determining an alphabet vector of a letter in the alphabet sequence;
and extracting a word vector representing word semantics according to the determined letter vector.
3. The method of claim 2, wherein extracting a word vector representing word semantics from the determined letter vector comprises:
sorting the letter vectors according to the arrangement positions of letters in the letter sequence to generate a letter vector sequence;
coding the letter vector sequence based on the letter vector attention distribution probability to generate semantic representation;
and decoding the semantic representation to obtain the word vector.
4. The method of claim 1, wherein the prosody prediction result of the chinese-to-english text is determined by a chinese-to-english prosody recognition model;
the Chinese-English mixed prosody recognition model comprises a Chinese-English mixed language model and a prosody network layer; the Chinese-English mixed language model is obtained based on unsupervised learning training.
5. A prosody prediction device, comprising:
the text splitting module is used for splitting the Chinese and English mixed text to be predicted to obtain a Chinese text and an English text;
The word vector determining module is used for determining word vectors of characters in the Chinese text and word vectors of words in the English text;
the result determining module is used for sequencing word vectors of the characters and word vectors of the words according to the positions of the characters and the words in the Chinese-English mixed text to generate a text vector sequence; determining a prosody prediction result of the Chinese-English mixed text according to the text vector sequence; wherein the prosody prediction result comprises a pause position, a pause type and a probability value corresponding to the pause type.
6. The apparatus of claim 5, wherein the word vector determination module comprises:
the letter segmentation unit is used for segmenting words in the English text into letter sequences;
a vector determination unit for determining an alphabet vector of the alphabet in the alphabet sequence;
and the vector extraction unit is used for extracting a word vector representing word semantics according to the determined letter vector.
7. The apparatus according to claim 6, wherein the vector extraction unit is specifically configured to:
sorting the letter vectors according to the arrangement positions of letters in the letter sequence to generate a letter vector sequence;
Coding the letter vector sequence based on the letter vector attention distribution probability to generate semantic representation;
and decoding the semantic representation to obtain the word vector.
8. The apparatus of claim 5, wherein the prosody prediction result of the Chinese-English mixed text is determined by a Chinese-English mixed prosody recognition model;
the Chinese-English mixed prosody recognition model comprises a Chinese-English mixed language model and a prosody network layer; the Chinese-English mixed language model is obtained based on unsupervised learning training.
9. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-4.
10. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911072965.0A CN110797005B (en) | 2019-11-05 | 2019-11-05 | Prosody prediction method, apparatus, device, and medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911072965.0A CN110797005B (en) | 2019-11-05 | 2019-11-05 | Prosody prediction method, apparatus, device, and medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110797005A CN110797005A (en) | 2020-02-14 |
CN110797005B true CN110797005B (en) | 2022-06-10 |
Family
ID=69442782
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911072965.0A Active CN110797005B (en) | 2019-11-05 | 2019-11-05 | Prosody prediction method, apparatus, device, and medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110797005B (en) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111724765B (en) * | 2020-06-30 | 2023-07-25 | 度小满科技(北京)有限公司 | Text-to-speech method and device and computer equipment |
CN112216267A (en) * | 2020-09-15 | 2021-01-12 | 北京捷通华声科技股份有限公司 | Rhythm prediction method, device, equipment and storage medium |
CN112131878B (en) * | 2020-09-29 | 2022-05-31 | 腾讯科技(深圳)有限公司 | Text processing method and device and computer equipment |
CN112289305A (en) * | 2020-11-23 | 2021-01-29 | 北京有竹居网络技术有限公司 | Prosody prediction method, device, equipment and storage medium |
CN112397050B (en) * | 2020-11-25 | 2023-07-07 | 北京百度网讯科技有限公司 | Prosody prediction method, training device, electronic equipment and medium |
CN113327579A (en) * | 2021-08-03 | 2021-08-31 | 北京世纪好未来教育科技有限公司 | Speech synthesis method, speech synthesis device, storage medium and electronic equipment |
CN113836305B (en) * | 2021-09-29 | 2024-03-22 | 有米科技股份有限公司 | Text-based industry category identification method and device |
CN116665643B (en) * | 2022-11-30 | 2024-03-26 | 荣耀终端有限公司 | Rhythm marking method and device and terminal equipment |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1731510A (en) * | 2004-08-05 | 2006-02-08 | 摩托罗拉公司 | Text-speech conversion for amalgamated language |
CN105118499A (en) * | 2015-07-06 | 2015-12-02 | 百度在线网络技术(北京)有限公司 | Rhythmic pause prediction method and apparatus |
CN105989833A (en) * | 2015-02-28 | 2016-10-05 | 讯飞智元信息科技有限公司 | Multilingual mixed-language text character-pronunciation conversion method and system |
CN106601228A (en) * | 2016-12-09 | 2017-04-26 | 百度在线网络技术(北京)有限公司 | Sample marking method and device based on artificial intelligence prosody prediction |
CN107039034A (en) * | 2016-02-04 | 2017-08-11 | 科大讯飞股份有限公司 | A kind of prosody prediction method and system |
CN108305612A (en) * | 2017-11-21 | 2018-07-20 | 腾讯科技(深圳)有限公司 | Text-processing, model training method, device, storage medium and computer equipment |
CN109697973A (en) * | 2019-01-22 | 2019-04-30 | 清华大学深圳研究生院 | A kind of method, the method and device of model training of prosody hierarchy mark |
CN110298035A (en) * | 2019-06-04 | 2019-10-01 | 平安科技(深圳)有限公司 | Word vector based on artificial intelligence defines method, apparatus, equipment and storage medium |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101000764B (en) * | 2006-12-18 | 2011-05-18 | 黑龙江大学 | Speech synthetic text processing method based on rhythm structure |
CN110176225B (en) * | 2019-05-30 | 2021-08-13 | 科大讯飞股份有限公司 | Method and device for evaluating rhythm prediction effect |
-
2019
- 2019-11-05 CN CN201911072965.0A patent/CN110797005B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1731510A (en) * | 2004-08-05 | 2006-02-08 | 摩托罗拉公司 | Text-speech conversion for amalgamated language |
CN105989833A (en) * | 2015-02-28 | 2016-10-05 | 讯飞智元信息科技有限公司 | Multilingual mixed-language text character-pronunciation conversion method and system |
CN105118499A (en) * | 2015-07-06 | 2015-12-02 | 百度在线网络技术(北京)有限公司 | Rhythmic pause prediction method and apparatus |
CN107039034A (en) * | 2016-02-04 | 2017-08-11 | 科大讯飞股份有限公司 | A kind of prosody prediction method and system |
CN106601228A (en) * | 2016-12-09 | 2017-04-26 | 百度在线网络技术(北京)有限公司 | Sample marking method and device based on artificial intelligence prosody prediction |
CN108305612A (en) * | 2017-11-21 | 2018-07-20 | 腾讯科技(深圳)有限公司 | Text-processing, model training method, device, storage medium and computer equipment |
CN109697973A (en) * | 2019-01-22 | 2019-04-30 | 清华大学深圳研究生院 | A kind of method, the method and device of model training of prosody hierarchy mark |
CN110298035A (en) * | 2019-06-04 | 2019-10-01 | 平安科技(深圳)有限公司 | Word vector based on artificial intelligence defines method, apparatus, equipment and storage medium |
Non-Patent Citations (1)
Title |
---|
一种基于Tacotron2的端到端中文语音合成方案;王国梁等;《东北师范大学学报(自然科学版)》;20190731;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN110797005A (en) | 2020-02-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110797005B (en) | Prosody prediction method, apparatus, device, and medium | |
CN111967268A (en) | Method and device for extracting events in text, electronic equipment and storage medium | |
CN111274764B (en) | Language generation method and device, computer equipment and storage medium | |
CN111078865B (en) | Text title generation method and device | |
CN110619867B (en) | Training method and device of speech synthesis model, electronic equipment and storage medium | |
CN111143561B (en) | Intention recognition model training method and device and electronic equipment | |
CN110717327A (en) | Title generation method and device, electronic equipment and storage medium | |
CN112633017B (en) | Translation model training method, translation processing method, translation model training device, translation processing equipment and storage medium | |
CN111241832A (en) | Core entity labeling method and device and electronic equipment | |
CN112365880A (en) | Speech synthesis method, speech synthesis device, electronic equipment and storage medium | |
CN110807331B (en) | Polyphone pronunciation prediction method and device and electronic equipment | |
CN112489637A (en) | Speech recognition method and device | |
CN112270198B (en) | Role determination method and device, electronic equipment and storage medium | |
CN111079945B (en) | End-to-end model training method and device | |
CN112507735A (en) | Training method and device of machine translation model and electronic equipment | |
JP2022151649A (en) | Training method, device, equipment, and storage method for speech recognition model | |
CN112153206B (en) | Contact person matching method and device, electronic equipment and storage medium | |
CN111950292A (en) | Training method of text error correction model, and text error correction processing method and device | |
CN110782871B (en) | Rhythm pause prediction method and device and electronic equipment | |
CN111858883A (en) | Method and device for generating triple sample, electronic equipment and storage medium | |
CN111241810A (en) | Punctuation prediction method and device | |
CN110767212B (en) | Voice processing method and device and electronic equipment | |
CN111666759A (en) | Method and device for extracting key information of text, electronic equipment and storage medium | |
CN112560499A (en) | Pre-training method and device of semantic representation model, electronic equipment and storage medium | |
CN111738015A (en) | Method and device for analyzing emotion polarity of article, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |