CN113823259A - Method and device for converting text data into phoneme sequence - Google Patents

Method and device for converting text data into phoneme sequence Download PDF

Info

Publication number
CN113823259A
CN113823259A CN202110832833.4A CN202110832833A CN113823259A CN 113823259 A CN113823259 A CN 113823259A CN 202110832833 A CN202110832833 A CN 202110832833A CN 113823259 A CN113823259 A CN 113823259A
Authority
CN
China
Prior art keywords
sentence
features
polyphonic
grammatical
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110832833.4A
Other languages
Chinese (zh)
Inventor
吴志勇
宋长河
周逸轩
卞衍尧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Shenzhen International Graduate School of Tsinghua University
Original Assignee
Tencent Technology Shenzhen Co Ltd
Shenzhen International Graduate School of Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd, Shenzhen International Graduate School of Tsinghua University filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202110832833.4A priority Critical patent/CN113823259A/en
Publication of CN113823259A publication Critical patent/CN113823259A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/06Elementary speech units used in speech synthesisers; Concatenation rules
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Machine Translation (AREA)

Abstract

A method, apparatus, device, and computer-readable storage medium for converting text data into a sequence of phonemes are disclosed. The method of converting text data into a sequence of phonemes comprises: extracting a sentence semantic feature corresponding to the sentence and a character semantic feature corresponding to one or more continuous characters in the sentence based on the sentence in the text data, determining a grammatical feature corresponding to the sentence based on the sentence semantic feature corresponding to the sentence, determining a polyphonic feature based on the character semantic feature and the grammatical feature corresponding to the sentence, the polyphonic feature indicating polyphonic pronunciation information of the character, and determining a phoneme sequence corresponding to the sentence based on the grammatical feature and the polyphonic feature. The method utilizes the neural network to extract the grammatical features and polyphone features in the text data, fuses the features in a cascading mode, and optionally introduces the tonal modification information in the text data, so that the synthesized voice is more natural.

Description

Method and device for converting text data into phoneme sequence
Technical Field
The present disclosure relates to the field of artificial intelligence services, and more particularly, to a method, apparatus, device, and computer-readable storage medium for converting text data into a phoneme sequence.
Background
Text-To-Speech (TTS) technology has been proposed To convert Text data into Speech. TTS technology has been widely applied to products such as voice assistants, intelligent navigation, electronic books, and the like. The TTS technology uses linguistics and psychology simultaneously, and intelligently converts characters into natural voice streams through the design of a neural network. However, current TTS technology is still not friendly enough for ideographic based languages (such as speaking chinese).
Currently, before generating speech, it is necessary to convert an input character sequence into a corresponding sequence of pronunciation phonemes. This conversion process is also referred to as front-end processing in TTS technology. Ideographic languages typically have variations, such as, for example, the second, third, and soft pitch of Chinese. These transposition results in an inaccurate sequence of converted pronunciation phonemes. Currently, the linguistic conversion of ideograms into phoneme sequences is almost based on conversion rules preset by linguists. For example, in the case of Chinese, it is common for a linguist to summarize a series of Chinese characters into pronunciation annotation rules, which are then written into a computer-understandable form. However, establishing preset transformation rules is labor intensive and difficult to cover all situations. In addition, as such rules become more complex, the conversions of the same Chinese character may be matched by multiple rules, resulting in rule conflicts. As data grows, more and more researchers are trying to use statistical-based methods for front-end processing. However, the above methods are highly dependent on the experience of the feature engineering and modelers.
Researchers are also considering the use of neural networks to solve the above problems. However, the current neural network scheme still has the problems of difficult speech annotation, inaccurate prediction and the like, and the problems cause the speech synthesis quality to be low. Therefore, further improvements to the front-end processing scheme in existing TTS technology are needed to synthesize a more ideographic language friendly speech.
Disclosure of Invention
Embodiments of the present disclosure provide a method and apparatus for converting text data into a phoneme sequence, a method and apparatus for simplifying a complex text processing model into a lightweight text processing model, and a computer-readable storage medium.
An embodiment of the present disclosure provides a method of converting text data into a sequence of phonemes, including: extracting a sentence semantic feature corresponding to the sentence and a character semantic feature corresponding to one or more continuous characters in the sentence based on the sentence in the text data, determining a grammatical feature corresponding to the sentence based on the sentence semantic feature corresponding to the sentence, determining a polyphonic feature based on the character semantic feature and the grammatical feature corresponding to the sentence, the polyphonic feature indicating polyphonic pronunciation information of the character, and determining a phoneme sequence corresponding to the sentence based on the grammatical feature and the polyphonic feature.
An embodiment of the present disclosure provides an apparatus for converting text data into a phoneme sequence, including: an extraction unit configured to extract a sentence semantic feature corresponding to a sentence and a character semantic feature corresponding to one or more continuous characters in the sentence based on the sentence in the text data, a first determination unit configured to determine a grammatical feature corresponding to the sentence based on the sentence semantic feature, a second determination unit configured to determine a polyphonic feature based on the character semantic feature and a grammatical feature corresponding to the sentence, the polyphonic feature indicating polyphonic pronunciation information of a character, and a third determination unit configured to determine a phoneme sequence corresponding to the sentence based on the grammatical feature and the polyphonic feature.
An embodiment of the present disclosure provides an apparatus for converting text data into a phoneme sequence, including: a processor; and a memory, wherein the memory has stored therein a computer-executable program that, when executed by the processor, performs the method described above.
An embodiment of the present disclosure provides an apparatus for simplifying a complex text processing model into a lightweight text processing model, including: a processor; a memory storing computer instructions that, when executed by the processor, implement the above-described method.
Embodiments of the present disclosure provide a computer-readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the above-described method.
According to another aspect of the present disclosure, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable medium, and the processor executes the computer instructions to cause the computer device to perform the method provided in the above aspects or various alternative implementations of the above aspects.
The embodiments of the present disclosure provide a proposed method for converting text data into phoneme sequences, which extracts grammatical features and polyphonic features in the text data by using a neural network, fuses the features in a cascade manner, and optionally introduces tonal information in the text data.
The embodiment of the disclosure fuses a plurality of features in text data in a cascade form, thereby obtaining a feature fused with mutual influence information among the plurality of features.
The embodiment of the disclosure also introduces polyphone characteristics in the front-end processing process to eliminate the divergence of the speech synthesis process, thereby providing more correct dictionary pronunciation for the character sequence to be synthesized.
The embodiment of the disclosure introduces the grammatical feature to assist prosody control in the front-end processing process, so that the prosody of the synthesized voice is more accurate.
And fourth, the embodiment of the disclosure also introduces tonal modification information in the front-end processing process, so that the synthesized voice is more natural.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings used in the description of the embodiments will be briefly introduced below. The drawings in the following description are merely exemplary embodiments of the disclosure.
Fig. 1A is an example schematic diagram illustrating an application scenario according to an embodiment of the present disclosure.
Fig. 1B is an exemplary diagram illustrating a model for converting text data into a phoneme sequence.
Fig. 2A is a flowchart illustrating a method of converting text data into a phoneme sequence according to an embodiment of the present disclosure.
Fig. 2B is a schematic diagram illustrating a method 200 of converting text data into a sequence of phonemes, in accordance with an embodiment of the present disclosure.
Fig. 3 is yet another schematic diagram illustrating a method 200 of converting text data into a sequence of phonemes according to an embodiment of the present disclosure.
FIG. 4A is an example schematic diagram of a set of component span scores according to an embodiment of the disclosure.
Fig. 4B is a schematic diagram of a syntax tree according to an embodiment of the present disclosure.
FIG. 5 is yet another schematic diagram of a polyphonic analysis module in accordance with an embodiment of the present disclosure.
Fig. 6 is a schematic diagram of an apparatus for converting text data into a phoneme sequence according to an embodiment of the present disclosure.
Fig. 7 is a block diagram illustrating an apparatus for converting text data into a phoneme sequence according to an embodiment of the present disclosure.
Detailed Description
In order to make the objects, technical solutions and advantages of the present disclosure more apparent, example embodiments according to the present disclosure will be described in detail below with reference to the accompanying drawings. It is to be understood that the described embodiments are merely a subset of the embodiments of the present disclosure and not all embodiments of the present disclosure, with the understanding that the present disclosure is not limited to the example embodiments described herein.
In the present specification and the drawings, steps and elements having substantially the same or similar characteristics are denoted by the same or similar reference numerals, and repeated description of the steps and elements will be omitted. Meanwhile, in the description of the present disclosure, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance or order.
For the purpose of describing the present disclosure, concepts related to the present disclosure are introduced below.
The present disclosure may utilize an acoustic model to implement the method of converting text data into a sequence of phonemes. The first encoder, the second encoder, the component analysis module, the pinyin prediction layer, the pitch analysis module, the decoder, the speech generation module and the like mentioned below are all the constituent modules of the acoustic model.
The acoustic model of the present disclosure may be Artificial Intelligence (AI) based. Artificial intelligence is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. For example, with the acoustic model of the present disclosure, it is possible to translate languages of multiple different languages in a manner similar to how humans read and understand the languages. Artificial intelligence the acoustic model of the present disclosure has the ability to understand and translate multiple languages of different languages into another language by studying the design principles and implementation of various intelligent machines.
The artificial intelligence technology relates to the field of extensive, and has the technology of hardware level and the technology of software level. The artificial intelligence software technology mainly comprises a computer vision technology, natural language processing, machine learning/deep learning and the like.
Optionally, the acoustic model in the present disclosure employs Natural Language Processing (NLP) technology. Natural language processing technology is an important direction in the fields of computer science and artificial intelligence, and can implement various theories and methods for effectively communicating between human and computer by using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Thus, based on natural language processing techniques, the acoustic model of the present disclosure may analyze input text data and extract features in the text data, and then generate audio data in a manner that humans can read the text aloud.
Optionally, the natural language processing techniques employed by embodiments of the present disclosure may also be Machine Learning (ML) and deep Learning based. Machine learning is a multi-field cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The natural language processing technology utilizes machine learning to study how a computer simulates or realizes the behavior of human learning language, acquires new knowledge or skills by analyzing the existing and classified text data, and reorganizes the existing knowledge structure to continuously improve the performance of the knowledge structure. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and the like.
Alternatively, the acoustic models that are useful in the embodiments of the present disclosure hereinafter may all be artificial intelligence models, in particular artificial intelligence based neural network models. Typically, artificial intelligence based neural network models are implemented as acyclic graphs, with neurons arranged in different layers. Typically, the neural network model comprises an input layer and an output layer, the input layer and the output layer being separated by at least one hidden layer. The hidden layer transforms input received by the input layer into a representation that is useful for generating output in the output layer. The network nodes are all connected to nodes in adjacent layers via edges, and no edge exists between nodes in each layer. Data received at a node of an input layer of a neural network is propagated to a node of an output layer via any one of a hidden layer, an active layer, a pooling layer, a convolutional layer, and the like. The input and output of the neural network model may take various forms, which the present disclosure does not limit.
The scheme provided by the embodiment of the disclosure relates to technologies such as artificial intelligence, natural language processing and machine learning, and is specifically described by the following embodiment.
The acoustic model of the embodiments of the present disclosure may be specifically integrated in an electronic device, which may be a terminal or a server or the like. For example, the acoustic model may be integrated in the terminal. The terminal may be, but is not limited to, a mobile phone, a tablet Computer, a notebook Computer, a desktop Computer, a Personal Computer (PC), a smart speaker, a smart watch, or the like. As another example, the acoustic model may be integrated on a server. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, Network service, cloud communication, middleware service, domain name service, security service, Content Delivery Network (CDN), big data and an artificial intelligence platform. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the disclosure is not limited thereto.
It can be understood that the device for reasoning by applying the acoustic model of the embodiment of the present disclosure may be a terminal, a server, or a system composed of a terminal and a server.
It is understood that the method of the acoustic model of the embodiment of the present disclosure for converting text data into acoustic features may be executed on a terminal, a server, or both.
The acoustic model provided by the embodiment of the disclosure can also relate to artificial intelligence cloud services in the cloud technology field. The Cloud technology (Cloud technology) is a hosting technology for unifying series resources such as hardware, software, network and the like in a wide area network or a local area network to realize calculation, storage, processing and sharing of data. The cloud technology is a general term of network technology, information technology, integration technology, management platform technology, application technology and the like applied based on a cloud computing business model, can form a resource pool, is used as required, and is flexible and convenient. Cloud computing technology will become an important support. Background services of the technical network system require a large amount of computing and storage resources, such as video websites, picture-like websites and more web portals. With the high development and application of the internet industry, each article may have its own identification mark and needs to be transmitted to a background system for logic processing, data in different levels are processed separately, and various industrial data need strong system background support and can only be realized through cloud computing.
Among them, the artificial intelligence cloud Service is also generally called AIaaS (AI as a Service, chinese). The method is a service mode of an artificial intelligence platform, and particularly, the AIaaS platform splits several types of common AI services and provides independent or packaged services at a cloud. This service model is similar to the one opened in an AI theme mall: all developers can access one or more artificial intelligence services provided by the platform by means of an Application Programming Interface (API), and some of the qualified developers can also use an AI framework and an AI infrastructure provided by the platform to deploy and operate and maintain self-dedicated cloud artificial intelligence services.
Fig. 1A is an example schematic diagram illustrating a scenario 100 in which an acoustic model infers in accordance with an embodiment of the present disclosure. Fig. 1B is an exemplary diagram illustrating a model for converting text data into a phoneme sequence.
Currently, there are a number of speakable applications. A user may install a speakable application on their user terminal and indicate to the speakable application that text data needs to be converted into audio data. Then, the user terminal may transmit a text data conversion request to the server of the application through the network, then receive the converted audio data corresponding to the text data, and then play the audio data.
After receiving the text data to be converted, the server converts the text data by using the acoustic model to obtain audio data, and then feeds back the audio data (for example, the audio data corresponding to the text data in fig. 1A) to the user.
The user may score the audio data. For example, if the user considers that the audio data and the text data have good correspondence, can still pronounce accurately in the case of polyphones, and is close to the effect of reading by a real person, the user may give a higher score to the audio data, and the server may take the text data-audio data pair as a positive sample for training the acoustic model in real time. If the user gives a lower score to the audio data, the server may treat the text data-audio data pair as a negative example for training the acoustic model in real-time. A collection of a plurality of such text data-audio data pairs is also referred to as a text-to-speech data set.
Of course, the server may also use other ways to obtain samples for training the acoustic model. For example, the server may grab audio and corresponding text of live-spoken text that already exists in the current internet environment, and then train the acoustic model using such live-spoken text. For example, referring to FIG. 1A, the server may retrieve text from a database and then use it for training of the acoustic model.
Current acoustic models for converting text data into phoneme sequences can be complex, or difficult to phonetic label, or result in inaccurate prediction results. Several acoustic models that may be used are briefly described below with reference to FIG. 1B.
Shan, Dai, Zhang, Zou et al propose a network of character-to-phoneme conversions (G2P) or an assisted G2P network as shown in fig. 1B to alleviate the problem of inaccurate synthesized speech due to complex pronunciation labeling rules. In this G2P network, a polyphonic disambiguation model is added to make the conversion between polyphonic words and phonemes more accurate. Specifically, the polyphonic disambiguation model takes contextual information as input and is trained independently through separate data sets in anticipation of identifying ambiguous information in the polyphonic. Although the method provides conversion between polyphones and phonemes by using the polyphone disambiguation model, the scheme needs to extract context information through an independent module, and not only is the process complicated, but also the disambiguation effect is poor.
More recently, Dai et al have also attempted to assist in identifying ambiguous information in polyphones by extracting embedded information in a pre-trained BERT model. Pan et al have attempted to correct pronunciation errors using a system of sound variation law. However, in these methods, each sub-module is trained independently using a heterogeneous data set, resulting in isolation of knowledge learned by each sub-module, and thus poor performance and robustness.
The present disclosure provides a method of converting text data into a sequence of phonemes, comprising: extracting a sentence semantic feature corresponding to the sentence and a character semantic feature corresponding to one or more continuous characters in the sentence based on the sentence in the text data, determining a grammatical feature corresponding to the sentence based on the sentence semantic feature corresponding to the sentence, determining a polyphonic feature based on the character semantic feature and the grammatical feature corresponding to the sentence, the polyphonic feature indicating polyphonic pronunciation information of the character, and determining a phoneme sequence corresponding to the sentence based on the grammatical feature and the polyphonic feature.
Thus, the acoustic model of the present disclosure is composed of a plurality of cascaded sub-modules. For example, these sub-modules may be involved in region-selection analysis of participles and part-of-speech tagging, grammar tree-based language feature learning, attention-based polyphonic disambiguation learning, speech change prediction, and speech generation, respectively. By means of cascading the sub-modules, information learned by each sub-module is fused and shared with each other, and therefore performance and robustness are improved. In addition, the acoustic model of the present disclosure may also improve the naturalness of the synthesized speech based on the component analysis. For example, the present disclosure extracts highly prosodic-related grammatical features from the component analysis tree and uses the grammatical features as additional input for TTS, thereby avoiding training a separate prosodic structure prediction module and improving performance and robustness.
Embodiments according to the present disclosure are described in more detail below with reference to fig. 2A-5.
Fig. 2A is a flow chart illustrating a method 200 of converting text data into a sequence of phonemes according to an embodiment of the present disclosure. Fig. 2B is a schematic diagram illustrating a method 200 of converting text data into a sequence of phonemes, in accordance with an embodiment of the present disclosure.
The method 200 of converting text data into a phoneme sequence according to an embodiment of the present disclosure may be applied to any electronic device. It is understood that the electronic device may be a different kind of hardware device, such as a Personal Digital Assistant (PDA), an audio/video device, a mobile phone, an MP3 player, a personal computer, a laptop computer, a server, etc. For example, the electronic device may be the server and the user terminal in fig. 1A, and so on. In the following, the present disclosure is described by taking a server as an example, and those skilled in the art should understand that the present disclosure is not limited thereto.
For example, the method 200 according to an embodiment of the present disclosure includes the following steps S201 to S204. First, in step S201, based on a sentence in the text data, a sentence semantic feature corresponding to the sentence and a character semantic feature corresponding to one or more continuous characters in the sentence are extracted. Next, in step S202, a grammatical feature corresponding to the sentence is determined based on the sentence meaning feature corresponding to the sentence. Next, in step S203, a polyphonic feature indicating polyphonic pronunciation information of the character is determined based on the character semantic features and the grammatical features corresponding to the sentence. In step S204, a phoneme sequence corresponding to the sentence is determined based on the grammatical features and the polyphonic features.
For example, the text data described in this disclosure may be any element that constitutes the text to be recited in fig. 1A, such as a word, a sentence, a phrase, a paragraph, a chapter, and the like. The present disclosure does not set any limit to the length and language type of the text data, for example, the text data may include text information of english, chinese, russian, japanese, korean, and the like, such as "baby likes and loves" in chinese, and the like. The following description is given by way of example only, and it will be understood by those skilled in the art that the present disclosure is not limited thereto.
For example, referring to fig. 2B, step S201 may be performed using a period analysis module. Alternatively, the term analysis module may be a neural network model. The neural network model is for example implemented as an acyclic graph, in which the neurons are arranged in different layers. The neural network model includes an input layer and an output layer, the input layer and the output layer being separated by at least one hidden layer. The hidden layer transforms input received by the input layer into a representation that is useful for generating output in the output layer. The network nodes are all connected to nodes in adjacent layers via edges, and no edge exists between nodes in each layer. Data received at a node of an input layer of a neural network is propagated to a node of an output layer via any one of a plurality of hidden layers, a plurality of active layers, a plurality of pooling layers, a plurality of convolutional layers, and the like. The input and output of the neural network model may take various forms, which the present disclosure does not limit. Optionally, the period analysis module may also be a bert (bidirectional Encoder retrieval from transformer) network designed for chinese characters. The BERT network may be trained using large-scale unlabeled corpora to obtain semantic information in the text data for subsequent processing.
For example, the semantic features extracted by the sentence semantic analysis module may correspond to the semantics of the whole sentence, and the character semantic features may correspond to the semantics of a combination of all/part of the continuous characters in the sentence. For example, if a sentence in the text data is "children are good and loved", the semantic features corresponding to the sentence correspond to the semantics of the sentence, and the character semantic features may correspond to the semantics of characters such as "children", "good", "loved", "like", and the like, or consecutive characters, respectively. In the embodiment of the present disclosure, the sentence semantic feature and the character semantic feature may also be identical features to facilitate subsequent operations. The semantic character feature may also be composed of a part of elements in the sentence meaning feature, which is not limited by the present disclosure.
Continuing next with fig. 2B, step S202 can be performed using a parsing module and a grammar feature learning module. Optionally, one or more neural network models may be included in the parsing module and the grammar feature learning module. Optionally, the neural network model in the parsing module is trained with the syntax tree data set as a sample. The structure and training mode of the neural network model in the syntax analysis module and the grammatical feature learning module will be further described with reference to fig. 3 to 5, and will not be repeated here.
For example, the parsing module may determine a grammatical feature corresponding to the sentence based on the sentence sense feature of the sentence. Optionally, the above-mentioned syntactic characteristics are fused with syntactic structure information of the sentence, part-of-speech information corresponding to each participle in the sentence, participle boundary information, participle position information, and the like. For example, neurons in a neural network model in the parsing module may predict part-of-speech information, participle boundary information, and participle position information corresponding to a sentence based on the sentence meaning features of the sentence after training. Furthermore, the neurons in the neural network model in the parsing module may predict phrase-level information (e.g., combination information of phrases, component information of phrases, boundary information of phrases, position information of phrases, etc.) and sentence-level information (e.g., syntactic structure information of sentences, boundary information of sentences, attention information of sentences, etc.) based on the sentence meaning characteristics of a certain sentence after training, and the disclosure is not limited thereto.
The neural network may then use this information to determine the corresponding grammatical encoding features of the sentence. Then, the grammar analysis module determines the grammar characteristics corresponding to the sentence by utilizing the grammar coding characteristics corresponding to the sentence.
For example, the neural network of the grammar feature learning module may further determine grammar features for polyphonic pronunciations and grammar features for prosody based on the grammar features determined by the grammar analysis module. Alternatively, since determining polyphonic information for a certain character often requires only information for a few characters before and after the character, and does not require information for the entire sentence, the length (span) of a grammatical feature for polyphonic characters may be smaller than a grammatical feature for prosody.
Next, step S203 may be performed using a polyphonic analysis module. Optionally, the neural network in the polyphone analysis module may fuse the syntactic features and the character semantic features, so as to extract polyphone pronunciation information of the characters. For example, the polyphone analysis module may splice the syntactic features and the character semantic features into initial polyphone features, and then determine the polyphone features based on the initial polyphone features and dictionary pronunciation information corresponding to each character in the character combination. Optionally, after the neural network model in the polyphonic analysis module is trained, dictionary pronunciation information may be fused into neuron parameters of the neural network model. The pronunciation information of the dictionary relates to polyphonic pronunciation information of a certain character in the dictionary. For example, the neural network model in the polyphonic analysis module may predict the pronunciation mode indicated by a certain character in the dictionary based on the initial polyphonic characteristics (e.g., obtaining pinyin information of a certain polyphonic character). The structure and training mode of the neural network model in the multi-phonetic character analysis module, and how to extract the dictionary pronunciation information through training will be further described with reference to fig. 3 to 5, which will not be repeated herein.
For example, as described above, the neural network of the grammar feature learning module may determine grammar features for polyphonic pronunciations based on grammar features corresponding to the sentences. Thus, the polyphonic analysis module may also determine polyphonic characteristics based on the grammatical features for the polyphonic pronunciation and the character semantic features. Alternatively, the grammatical features for the polyphonic pronunciation may be determined by a grammatical feature learning module. That is, the polyphone analysis module may be cascaded with the grammar feature learning module to directly obtain grammar features for polyphone pronunciation. Then, the polyphone analysis module may splice the grammatical features for polyphone pronunciation and the character semantic features into initial polyphone features, and then determine the polyphone features based on the initial polyphone features and dictionary pronunciation information corresponding to each character in the character combination. Although FIG. 2B shows the polyphonic analysis module cascaded with the grammar feature learning module to obtain grammar information for polyphonic words, those skilled in the art will also appreciate that the polyphonic analysis module may also be cascaded with the grammar analysis module to directly obtain grammar features determined by the grammar analysis module.
In addition, the polyphone analysis module may be cascaded with the transposition analysis module to input polyphone features to the transposition analysis module. Optionally, the tone variation information may be fused in each neuron in the neural network in the tone variation analysis module, so that the tone variation analysis module may fuse the polyphone features and the tone variation information by using the built-in neural network thereof to determine the polyphone features fused with the tone variation information.
As an example, the polyphone feature and the polyphone feature fused with the pitch information may be a pinyin sequence for generating a voice or a multidimensional numerical vector fused with pinyin information of each character in the sentence, and the disclosure is not limited thereto.
Next, step S204 may be performed using a speech generation module. For example, as described above, the neural network of the grammar feature learning module may determine grammar features for prosody based on grammar features corresponding to the sentences. The speech generation module may then determine a phoneme sequence corresponding to the sentence based on the prosodic grammar features and the polyphonic features. That is, the speech generation module may be cascaded with the grammar feature learning module to directly acquire grammar features for prosody. Although fig. 2B shows the speech generating module and the grammar feature learning module being cascaded to obtain the grammar information for prosody, those skilled in the art will also understand that the speech generating module may be cascaded with the grammar analyzing module to directly obtain the grammar features determined by the grammar analyzing module. The syntactic structure information of the sentence and the prosody information of the sentence are highly similar, so that prosody control of the voice can be assisted by using the prosody-oriented grammar features, a matching rule does not need to be designed for the prosody control alone or a prosody structure prediction module does not need to be trained alone, the difficulty of the prosody control is reduced, and the synthesized voice is more natural.
As an example, the speech generation module may also be directly cascaded with the pitch analysis module to directly obtain polyphonic features fused with pitch information. And further, the speech generating module may further determine a phoneme sequence corresponding to the sentence based on the polyphonic features fused with the tonal information and the grammatical features for prosody. Although fig. 2B shows a cascade connection of the speech generating module and the pitch analysis module to obtain the polyphonic features with the pitch information fused thereto, it will be understood by those skilled in the art that the speech generating module may also be cascaded with the polyphonic analysis module to directly obtain the polyphonic features determined by the polyphonic analysis module.
For example, in the case that the polyphonic features and the polyphonic features fused with the pitch information are pinyin sequences (or multidimensional numerical vectors fused with the pinyin information of each character in the sentence) for generating voices, the voice generation module may further include a pinyin-to-phoneme conversion module to convert the polyphonic features (or the polyphonic features fused with the pitch information) into the initial phoneme sequences. Then, other models (which may be linear models or neural network models) in the speech generation module determine a phoneme sequence corresponding to the sentence by combining the initial phoneme sequence and the grammatical features (or grammatical features for prosody).
In addition, the speech generation module may also include a neural network and vocoder that convert the phoneme sequence into audio data. For example, the neural network that converts the phoneme sequence into audio data may be an attention-based autoregressive neural network model (e.g., Tacotron) or a duration predictor-based non-autoregressive feedforward neural network model (e.g., Fast-speech), and so on, and the disclosure is not limited thereto. The audio data described in this disclosure may be mel-frequency characteristic data, and of course, the audio data may also be audio data in any format that can be decoded by the vocoder, and this disclosure is not limited thereto.
Thus, embodiments of the present disclosure provide a proposed method for converting text data into phoneme sequences, which extracts grammatical features and polyphonic features in the text data using a neural network and merges the features in a cascade manner, and optionally introduces tonal information in the text data, and compared to the previous methods, the embodiments of the present disclosure have the following four advantages.
The embodiment of the disclosure fuses a plurality of features in text data in a cascade form, thereby obtaining a feature fused with mutual influence information among the plurality of features.
The embodiment of the disclosure also introduces polyphone characteristics in the front-end processing process to eliminate the divergence of the speech synthesis process, thereby providing more correct dictionary pronunciation for the character sequence to be synthesized.
The embodiment of the disclosure introduces the grammatical feature to assist prosody control in the front-end processing process, so that the prosody of the synthesized voice is more accurate.
And fourth, the embodiment of the disclosure also introduces tonal modification information in the front-end processing process, so that the synthesized voice is more natural.
The various modules described above are described in more detail below in conjunction with fig. 3-5.
Fig. 3 is yet another schematic diagram illustrating a method 200 of converting text data into a sequence of phonemes according to an embodiment of the present disclosure. FIG. 4A is an example schematic diagram of a set of component span scores according to an embodiment of the disclosure. Fig. 4B is a schematic diagram of a syntax tree according to an embodiment of the present disclosure. FIG. 5 is yet another schematic diagram of a polyphonic analysis module in accordance with an embodiment of the present disclosure.
Referring to fig. 3, the sentence semantic analysis module extracts, based on the sentences in the text data, sentence semantic features corresponding to the sentences and character semantic features corresponding to one or more continuous characters in the sentences. The sentence meaning analysis module inputs the sentence meaning characteristic and the character semantic characteristic to the grammar analysis module and the polyphone analysis module respectively. As one example, the period analysis module is a BERT model for Chinese characters, the output of which is a Chinese character BERT embedded sequence. For example, the Chinese character BERT embedding sequence may be [ C ]0,C1,C2,...,ck,...,cL,cL+1]For input, where L is the length (number of characters) of the input sentence, c0、cL+1Are special start and end markers that assist in subsequent component analysis module and dynamic programming decoder calculations.
The syntax analysis module may include a first encoder, a composition analysis module, and a dynamic programming decoder. Wherein the dynamic programming decoder is only used for the training process, and the first encoder and the component analysis module are both used for the training process and the reasoning process. As one example, the syntactic encoding characteristics corresponding to the sentence are determined by a first encoder based on the syntactic characteristics, the syntactic characteristics corresponding to the sentence are determined by a component analysis module based on the syntactic encoding characteristics, and the first encoder and the component analysis module are each trained from a syntactic tree dataset.
For example, the training of the first encoder may comprise the following steps. First, based on a syntax sample sentence in the syntax tree data set, a corresponding syntax coding feature of the sample sentence is determined by the first encoder. Then, by using the component analysis module, determining the grammatical features corresponding to the grammatical sample sentences and extracting component span scores in the grammatical coding features corresponding to the grammatical sample sentences. And then, based on the component span scores, determining a predicted part-of-speech label, a predicted part-word boundary label and a predicted part-word position label corresponding to each part word in the grammar sample sentence. And then, calculating a value corresponding to a first loss function based on the predicted part-of-speech label, the predicted part-of-speech boundary label and the predicted part-of-speech position label corresponding to each part-word in the grammar sample sentence, and the actual part-of-speech label, the actual part-of-speech boundary label and the actual part-of-speech position label corresponding to each part-word in the grammar sample sentence. Then, based on the value corresponding to the first loss function, parameters of neurons in the first encoder and the component analysis module are adjusted so that the first loss function converges.
Optionally, the first encoder is a neural network model stacked by 8 identical transducers (transformers). Wherein each converter comprises a cascade of a multi-head attention layer, a first regularization layer, a feed-forward layer and a second regularization layer, and the output of the first regularization layer will also be input to the second regularization layer. The term feature can be input to not only the multi-head attention layer of the first converter but also the first regular layer, and the disclosure is not limited thereto.
As shown in fig. 3, the first encoder may input the syntax feature, output the syntax feature, and input the syntax feature to the component analysis module. That is, the first encoder is configured to implement determining the corresponding syntactic encoding characteristic of the sentence based on the sentence meaning characteristic. Continuing the example above, assume that the first encoder embeds the sequence [ c ] with the Chinese character BERT described above0,c1,c2,...,ck,...,cL,cL+1]For input, the features [ y ] are coded in syntax0,y1,y2,...,yk,...,yL,yL+1]For output, the syntactic encoding characteristics have the same length as the chinese character BERT embedded sequence described above.
Alternatively, the component analysis module may take the syntactic coding characteristics as input and the syntactic characteristics as output. For example, the component analysis module may determine a set of component span scores for the syntactical coding features. As one example, in training the first encoder using the syntax tree data set, the component analysis module may be configured to determine a corresponding syntactic characteristic of the syntactic sample sentence and extract a component span score in the corresponding syntactic coding characteristic of the syntactic sample sentence. Continuing with the above example, the component analysis module encodes the grammar coding feature [ y ]0,y1,y2,...,yk,...,yL,yL+1]The component span scores are combined and a set of component span scores s (i, j, ·) is predicted for it. For example, the set of component span scores s (i, j, ·) can be calculated in equations (1) and (2).
s(i,j,·)=W2RCLU(LayerNorm(W1v+b1))+b2, (1)
Figure BDA0003176155400000141
Where 0 ≦ i ≦ j ≦ L, and v combines the elements in the syntax coding feature at the positions indicated by i and j.
Figure BDA0003176155400000142
And
Figure BDA0003176155400000143
capturing the bidirectional context information of the character at position k, the bidirectional context information being encoded by element y of the syntactic coding featurekAnd (6) deriving. For example,
Figure BDA0003176155400000144
representing elements from even positions for ykThe context information of (a) the user,
Figure BDA0003176155400000145
representing elements from odd positions for ykContext information of (1). W2、W1、b1、b2The neuron parameters to be trained in the component analysis module.
For example, the component analysis module may further construct a component analysis tree T composed of an upper triangular matrix of gray in fig. 3 to characterize the fraction of components over individual spans based on the set of component span fractions. For example, the composition analysis tree T may be represented by formula (3).
T:={(it,jt;lt):f=l,...,|T|}. (3)
Thus, the optimal analysis of components tree TbestCan be expressed by equation (4).
Figure BDA0003176155400000151
That is, the optimal label/for the span (i, j) can be found by solving the above equation (4) and the span (i, i) is cut into two subspans (i, m) and (m, j). The subspans (i, m) and (m, j) correspond to the two subcomponents of the component analysis tree T under span (i, j). For example, the two sub-components may be further represented by equation (5). The labels in this document may be part-of-speech labels, segmentation boundary labels, segmentation position labels, etc., and the disclosure is not limited thereto.
Figure BDA0003176155400000152
For the ith character in the sentence, it is not necessary to further subdivide it over the span (i-1, j), but the optimal label is determined directly with equation (6).
Figure BDA0003176155400000153
Since the component analysis tree T in fig. 3, which is composed of a gray upper triangular matrix, does not have a valid component for all spans (i, j), auxiliary null flags are introduced here
Figure BDA0003176155400000155
The span score used to characterize this is 0. That is, a component having a span score of 0 in the component span score set can be represented by the following formula (7).
Figure BDA0003176155400000154
Fig. 4A shows, as an example, a span score representation for the sentence "children are good and lovely" which may be taken as a further example of the upper triangular matrix by grey in fig. 3. Further, span scores corresponding to some of the components in the set of component span scores s (i, j, ·) may be further copied to the next layer for subsequent resolution by a dynamic programming decoder. The span scores may correspond to part-of-speech tags and part-of-word boundary tags, respectively.
Thus, the extended syntax tree shown in fig. 4B can be further obtained by using a dynamic programming decoder according to the example in fig. 4A. In the extended syntax tree, the IP identifies the syntactic characteristics of the entire sentence. NP identifies the grammatical features related to nouns, VP identifies the grammatical features related to verbs, ADVP identifies the grammatical features related to adverbs, and SP identifies the grammatical features related to clauses. The extended syntax tree is an example of determining a predicted part-of-speech tag, a predicted part-of-speech boundary tag and a predicted part-of-speech position tag corresponding to each part-of-speech in the syntax sample sentence based on the component span scores.
In the extended syntax tree shown in fig. 4B, NN is the part-of-speech predicted by the component analysis module for "kids", i.e. the word is predicted to be a noun. The AD is the part of speech of the component analysis module for the good prediction, namely the word is predicted to be an adverb. The VA is the part of speech predicted by the component analysis module for the 'loveliness', namely the word is predicted to be a table language adjective. The IJ is the part of speech predicted by the component analysis module for "o", that is, the word is predicted to be an exclamation word. B is the boundary and position of the participle predicted by the component analysis module for "small" and "possible", i.e. predicting "small" and "possible" as the beginning of a word. M is the segmentation boundary and the segmentation position predicted by the component analysis module for the child, namely the child is predicted to be the middle character of a certain word. And E is the word segmentation boundary and the word segmentation position predicted by the component analysis module for the children and the love, namely the children and the love are predicted to be the end of a certain word. E is the word segmentation boundary and the word segmentation position predicted by the component analysis module for "o", that is, "o" is predicted as an individual word and/or is located at the end of a sentence. Therefore, the dynamic programming decoder realizes the determination of the predicted part-of-speech label, the predicted part-of-speech boundary label and the predicted part-of-speech position label corresponding to each participle in the grammar sample sentence based on the component span scores.
As described above, during the training of the first encoder in the parsing module, the dynamic programming decoder can output the predicted participle tag and the predicted part-of-speech tag corresponding to a sample sentence in the syntax tree data set. Then, a value corresponding to the first loss function can be calculated by calculating a predicted part-of-speech tag, a predicted part-of-speech boundary tag and a predicted part-of-speech position tag corresponding to each part-word in the sample sentence, and an actual part-of-speech tag, an actual part-of-speech boundary tag and an actual part-of-speech position tag corresponding to each part-word in the grammar sample sentence. Thus, the parameters of the neurons in the first encoder and the component analysis module may be adjusted based on the value corresponding to the first loss function such that the first loss function converges.
I.e. the first encoder training is completed when the first loss function converges. Through the training process, the first encoder learns the part-of-speech information, the segmentation boundary information and the segmentation position information corresponding to each segmentation in the sample sentences in the grammar tree data set. When facing the scene shown in fig. 1A, the first encoder may predict part-of-speech information, segmentation boundary information, and segmentation position information of the sentence to be read aloud based on part-of-speech information, segmentation boundary information, and segmentation position information, etc., which it has learned in the sample sentence. The first encoder may also predict phrase-level information and sentence-level information of the sentence to be read aloud based on phrase-level information (e.g., combination mode information of the phrase, component information of the phrase, boundary information of the phrase, position information of the phrase, etc.) and sentence-level information (e.g., syntactic structure information of the sentence, boundary information of the sentence, attention information of the sentence, etc.) learned in the sample sentence by it. Then, the corresponding syntactic coding characteristics of the sentence are further determined by utilizing the component analysis module. The component analysis module then further combines the syntactic coding features to output syntactic features.
With continued reference to FIG. 3, the grammatical features output by the component analysis module are input to a grammatical features learning module. The syntactic feature learning module includes a shared hidden layer, a first convolutional neural network layer (shown as a first CNN), and a second convolutional neural network (shown as a second CNN). And the shared hidden layer receives the grammatical features, further fuses the grammatical features, and then inputs the fused grammatical features into the first convolutional neural network layer and the second convolutional neural network layer respectively. Optionally, the first convolutional neural network layer and the second convolutional neural network are independent 1-dimensional structures, which respectively output the feature representations with the same length as the characters of the sentence.
As shown in fig. 3, a first CNN inputs grammatical features for prosody into the speech generation module, and a second CNN inputs grammatical features for polyphones into the polyphone analysis module. As described above, since it is often determined that polyphonic information of a certain character requires only information of several characters before and after the character, and does not require information of the entire sentence, the length (span) of a grammatical feature for polyphonic characters may be smaller than a grammatical feature for prosody. Thus, as one example, a first CNN that generates grammatical features for prosody may use a convolution kernel with a span of [3,5,7], and a second CNN that generates grammatical features for polyphones may use a convolution kernel with a span of [1,3,5 ]. Those skilled in the art will appreciate that the present disclosure is not so limited.
The polyphonic analysis module is further described with continued reference to fig. 3 and 5. The polyphone analysis module comprises a splicer, a second encoder and a pinyin prediction layer. The splicer splices the character semantic features and the grammatical features aiming at the polyphones into initial polyphone features. And the second encoder determines the polyphonic characteristics based on the initial polyphonic characteristics and dictionary pronunciation information corresponding to each character in the character combination.
Optionally, the second encoder is a neural network model stacked by 2 identical transducers (transformers). Wherein each converter comprises a cascade of a multi-head attention layer, a first regularization layer, a feed-forward layer and a second regularization layer, and the output of the first regularization layer will also be input to the second regularization layer. The term feature can be input to not only the multi-head attention layer of the first converter but also the first regular layer, and the disclosure is not limited thereto.
Optionally, the second encoder is trained from a polyphonic data set, the training of the second encoder comprising the following steps. Firstly, based on polyphonic sample sentences in the polyphonic data set, initial polyphonic characteristics corresponding to the polyphonic sample sentences are determined. And then, based on the initial polyphonic characteristics corresponding to the polyphonic sample sentences, determining the polyphonic characteristics corresponding to the polyphonic sample sentences by utilizing the second encoder. And then, based on the polyphone characteristics corresponding to the polyphone sample sentences, a pinyin prediction layer is utilized to determine the predicted pinyin labels corresponding to the characters in the polyphone sample sentences. Then, the value of a second loss function is calculated based on the predicted pinyin label corresponding to each character in the polyphonic sample sentence and the actual pinyin label corresponding to each character in the polyphonic sample sentence. And then, based on the value corresponding to the second loss function, adjusting the parameters of the second encoder and the neurons in the pinyin prediction layer so as to converge the second loss function.
Continuing with the above example, after the training of the grammar analysis module is completed, the trained semantic analysis module, grammar feature learning module, and splicer may be utilized to convert the polyphonic sample sentences in the polyphonic data set into initial polyphonic features. Then, the second encoder predicts polyphone characteristics corresponding to the polyphone sample sentences based on the initial polyphone characteristics and inputs the polyphone characteristics to the pinyin prediction layer. Referring to fig. 5, the second encoder will output the polyphonic features corresponding to each character in turn in order of time step. The pinyin prediction layer determines the predicted pinyin marks corresponding to the characters according to the information. For example, for "small", the pinyin word prediction layer decodes the polyphonic features as "xiao 3", i.e., "xiao" for the third sound. Similarly, the pinyin of each character in "good children like a good" is also further decoded according to the time step sequence.
Then, in the training process of the polyphone analysis module, a value of a second loss function may be calculated based on the predicted pinyin labels corresponding to each character in the polyphone sample sentence and the actual pinyin labels corresponding to each character in the polyphone sample sentence. As shown in fig. 3, the actual pinyin label may be pinyin information corresponding to each character in the dictionary. For example, if the pinyin of "small" in the dictionary is also labeled as "xiao" of the third sound, it indicates that the predicted result is the same as the actual result. Training of the polyphonic analysis module is complete if the values of the second loss function tend to converge for each sample in the polyphonic dataset.
Through the training process, the second encoder learns dictionary pronunciation information corresponding to each character in the sample sentence in the polyphonic character data set. When facing the scenario shown in fig. 1A, the second encoder may predict dictionary pronunciation information of each character of the sentence to be read aloud based on dictionary pronunciation information or the like learned in the sample sentence, and further determine the polyphonic feature corresponding to the sentence.
With continued reference to fig. 3, the polyphonic character analysis module inputs polyphonic character characteristics to the tonal analysis module. The pitch analysis module comprises a third convolutional neural network, an unshared output layer and a softmax layer which are cascaded. And the softmax layer outputs polyphone characteristics fused with tonal information.
The polyphone features fused with the tonal information are determined by a tonal analysis module, and the tonal analysis module is trained by a text-to-speech data set. The training of the tonal analysis module comprises the following steps. Firstly, based on the voice sample sentences in the text-to-voice data set, the tonal modification analysis module is utilized to determine the corresponding predicted polyphonic characteristics of the voice sample sentences. Then, a value of a third loss function is calculated based on the predicted polyphonic features corresponding to the speech sample sentence and the actual polyphonic features corresponding to the speech sample sentence. Finally, neuron parameters in the pitch analysis module are adjusted based on the value of the third loss function such that the third loss function converges.
Through the training process, the tonal modification analysis module learns tonal modification information in sample sentences in the text-to-speech data set during actual reading. When facing the scenario shown in fig. 1A, the pitch analysis module may predict pitch information of each character of the sentence to be read based on the pitch information learned in the sample sentence, and further determine the polyphonic feature fused with the pitch information corresponding to the sentence.
The tonal modification analysis module further fuses and adjusts the polyphone characteristics through a third convolutional neural network, a non-shared output layer and a softmax layer to obtain the polyphone characteristics fused with tonal modification information, and scores the polyphone characteristics fused with tonal modification information based on actual text-to-speech data, so that the obtained pinyin sequence is more accurate.
With continued reference to fig. 3, the tonal analysis module inputs the polyphonic features that incorporate tonal information to the speech generation module. The voice generating module comprises a pinyin conversion phoneme module, a phoneme splicing module, an audio conversion module and a vocoder. Specifically, the pinyin conversion to phoneme module may take the polyphone features fused with the transposition information as input or directly take the polyphone features determined by the polyphone analysis module as input, and convert the polyphone features into the initial phoneme sequence.
Then, the phoneme splicing module can splice the prosodic grammatical features and the initial phoneme sequence to obtain a phoneme sequence corresponding to the sentence. Since the phoneme sequence does not only include the phoneme information of the sentence but also the prosody information of the sentence, a separate audio conversion module needs to be trained to decode the phoneme sequence into audio data. Optionally, the frequency conversion module may include a third encoder, an attention layer, and a decoder.
Optionally, the speech generation module is trained from the text-to-speech data set, the training of the speech generation module comprising the following steps. Firstly, based on the speech sample sentence in the text-to-speech data set, the speech generation module is utilized to predict the corresponding audio of the speech sample sentence. Then, a value of a fourth loss function is calculated based on the predicted audio corresponding to the speech sample sentence and the actual audio corresponding to the speech sample sentence. Finally, neuron parameters in the speech generation module (e.g., neuron parameters in an audio conversion module) are adjusted based on the value of the fourth loss function such that the fourth loss function converges.
Through the training process, the speech generation module learns speech information in sample sentences in the text-to-speech dataset during actual reading. When facing the scenario shown in fig. 1A, the speech generation module may predict speech information of each character of a sentence to be read aloud based on speech information learned in the sample sentence, and further determine the speech corresponding to the sentence.
As described above, in the training process described above, each module may be trained independently, so that the gradient does not pass back (i.e., the gradient locks up) to avoid interference between the modules.
As shown in fig. 3-5, the acoustic model of the present disclosure is composed of a plurality of cascaded sub-modules. For example, these sub-modules may be involved in region-selection analysis of participles and part-of-speech tagging, grammar tree-based language feature learning, attention-based polyphonic disambiguation learning, speech change prediction, and speech generation, respectively. By means of cascading the sub-modules, information learned by each sub-module is fused and shared with each other, and therefore performance and robustness are improved. In addition, the acoustic model of the present disclosure may also improve the naturalness of the synthesized speech based on the component analysis. For example, the present disclosure extracts highly prosodic-related grammatical features from the component analysis tree and uses the grammatical features as additional input for TTS, thereby avoiding training a separate prosodic structure prediction module and improving performance and robustness.
In addition, the present disclosure also provides an apparatus for converting text data into a phoneme sequence. Fig. 6 is a block diagram illustrating an apparatus 600 for converting text data into a phoneme sequence according to an embodiment of the present disclosure. The apparatus 600 includes an extraction unit, a first determination unit, a second determination unit, and a third determination unit. An extraction unit configured to extract, based on a sentence in the text data, a sentence semantic feature corresponding to the sentence and a character semantic feature corresponding to one or more continuous characters in the sentence. The first determining unit is configured to determine grammatical features corresponding to the sentences based on the sentence meaning features corresponding to the sentences. A second determination unit configured to determine a polyphonic feature based on the character semantic feature and a grammatical feature corresponding to the sentence, the polyphonic feature indicating polyphonic pronunciation information of the character. A third determining unit configured to determine a phoneme sequence corresponding to the sentence based on the grammatical feature and the polyphonic feature.
Further, the second determination module is further configured to: determining grammatical features for polyphonic pronunciations based on grammatical features corresponding to the sentences, and determining polyphonic features based on the grammatical features for polyphonic pronunciations and the character semantic features.
The third determination module is further configured to: determining a grammar feature for prosody based on the grammar feature corresponding to the sentence, and determining a phoneme sequence corresponding to the sentence based on the grammar feature for prosody and the polyphone feature.
The above-described extraction unit may be similar to the above-described sentence semantic analysis module, the first determination unit may be similar to the above-described grammar analysis unit (or a combination of the grammar analysis unit and the grammar feature learning unit), the second determination unit may be similar to the above-described polyphonic analysis module (or a combination of the polyphonic analysis unit and the pitch analysis module), and the third determination unit may be similar to the above-described speech generation module. For the sake of brevity, the disclosure is not repeated herein.
Fig. 7 is a block diagram illustrating an apparatus 700 for converting text data into a phoneme sequence according to an embodiment of the present disclosure.
Referring to fig. 7, a device 700 may include a processor 701 and a memory 702. The processor 701 and the memory 702 may be connected by a bus 703.
The processor 701 may perform various actions and processes according to programs stored in the memory 702. In particular, the processor 701 may be an integrated circuit chip having signal processing capabilities. The processor may be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, which may be of the X87 or ARM architecture.
The memory 702 has stored thereon computer instructions that, when executed by the microprocessor, implement the method 200. The memory 702 may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example, but not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), Synchronous Link Dynamic Random Access Memory (SLDRAM), and direct memory bus random access memory (DR RAM). It should be noted that the memories of the methods described in this disclosure are intended to comprise, without being limited to, these and any other suitable types of memories.
Embodiments of the present disclosure provide a computer-readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the above-described method.
According to another aspect of the present disclosure, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable medium, and the processor executes the computer instructions to cause the computer device to perform the method provided in the above aspects or various alternative implementations of the above aspects.
It is to be noted that the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In general, the various example embodiments of this disclosure may be implemented in hardware or special purpose circuits, software, firmware, logic or any combination thereof. Certain aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device. While aspects of embodiments of the disclosure have been illustrated or described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that the blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
The exemplary embodiments of the invention, as set forth in detail above, are intended to be illustrative, not limiting. It will be appreciated by those skilled in the art that various modifications and combinations of the embodiments or features thereof may be made without departing from the principles and spirit of the invention, and that such modifications are intended to be within the scope of the invention.

Claims (15)

1. A method of converting text data into a sequence of phonemes, comprising:
extracting sentence semantic features corresponding to the sentences and character semantic features corresponding to one or more continuous characters in the sentences based on the sentences in the text data;
determining grammatical features corresponding to the sentence based on the sentence meaning features corresponding to the sentence;
determining polyphonic features based on the character semantic features and grammatical features corresponding to the sentence, the polyphonic features indicating polyphonic pronunciation information for the character, an
And determining a phoneme sequence corresponding to the sentence based on the grammatical features and the polyphonic features.
2. The method of claim 1, wherein said determining polyphonic features based on the character semantic features and the corresponding grammatical features of the sentence further comprises:
determining grammatical features for polyphonic pronunciations based on the grammatical features corresponding to the sentence, an
Determining polyphonic features based on the grammatical features for polyphonic pronunciations and the character semantic features.
3. The method of claim 2, wherein said determining a phoneme sequence corresponding to the sentence based on the grammatical features and the polyphonic features further comprises:
determining grammatical features for prosody based on corresponding grammatical features of the sentence,
and determining a phoneme sequence corresponding to the sentence based on the prosodic grammar characteristics and the polyphonic characteristics.
4. The method of claim 3, wherein said determining a phoneme sequence corresponding to the sentence based on the grammatical features and the polyphonic features further comprises:
determining polyphone characteristics fused with tonal modification information based on the polyphone characteristics and the tonal modification information;
and determining a phoneme sequence corresponding to the sentence based on the polyphone features fused with the tonal information and the grammar features aiming at the prosody.
5. The method of claim 1, wherein said determining the sentence-corresponding grammatical features based on the sentence-sense features of the sentence correspondence further comprises:
determining grammar coding characteristics corresponding to the sentence based on sentence meaning characteristics corresponding to the sentence and part-of-speech information, word segmentation boundary information and word segmentation position information corresponding to each word segmentation in the sentence;
and determining the grammatical feature corresponding to the sentence based on the grammatical coding feature corresponding to the sentence.
6. The method of claim 2, wherein said determining polyphonic features based on the grammatical features for polyphonic pronunciations and the character semantic features further comprises:
concatenating the grammatical features for polyphonic pronunciation and the character semantic features into an initial polyphonic feature, an
And determining the polyphone characteristics based on the initial polyphone characteristics and dictionary pronunciation information corresponding to each character in the character combination.
7. The method of claim 5, wherein the corresponding syntactical encoding features of the sentence are determined by a first encoder based on the sentence sense features, the corresponding syntactical features of the sentence are determined by a component analysis module based on the syntactical encoding features, the first encoder and the component analysis module being trained from a syntactical tree data set, the training of the first encoder comprising:
determining, with the first encoder, based on a sample sentence of syntax in the syntax tree dataset, a corresponding syntactic coding feature of the sample sentence,
determining the corresponding grammatical features of the grammatical sample sentences and extracting component span scores in the grammatical coding features corresponding to the grammatical sample sentences by utilizing the component analysis module;
determining a predicted part-of-speech label, a predicted part-of-speech boundary label and a predicted part-of-speech position label corresponding to each part-word in the grammar sample sentence based on the component span scores;
calculating a value corresponding to a first loss function based on a predicted part-of-speech label, a predicted part-of-speech boundary label and a predicted part-of-speech position label corresponding to each part-word in the grammar sample sentence, and an actual part-of-speech label, an actual part-of-speech boundary label and an actual part-of-speech position label corresponding to each part-word in the grammar sample sentence; and
and adjusting parameters of the neurons in the first encoder and the component analysis module based on the value corresponding to the first loss function so as to converge the first loss function.
8. The method of claim 6, wherein the polyphonic features are determined based on the initial polyphonic features by a second encoder, the second encoder being trained by a polyphonic dataset, the training of the second encoder comprising:
determining initial polyphone characteristics corresponding to polyphone sample sentences based on polyphone sample sentences in the polyphone data set;
determining polyphonic features corresponding to the polyphonic sample sentences by using the second encoder based on initial polyphonic features corresponding to the polyphonic sample sentences;
based on the polyphone characteristics corresponding to the polyphone sample sentences, a pinyin prediction layer is utilized to determine the predicted pinyin labels corresponding to all the characters in the polyphone sample sentences;
calculating a value of a second loss function based on the predicted pinyin label corresponding to each character in the polyphonic sample sentence and the actual pinyin label corresponding to each character in the polyphonic sample sentence; and
and adjusting parameters of the second encoder and the neurons in the Pinyin prediction layer based on the value corresponding to the second loss function so as to converge the second loss function.
9. The method of claim 4, wherein the polyphonic features having tonal information fused thereto are determined by a tonal analysis module, the tonal analysis module trained from a text-to-speech data set, the training of the tonal analysis module comprising:
determining, by the pitch analysis module, a predicted polyphonic feature corresponding to a speech sample sentence in the text-to-speech dataset based on the speech sample sentence,
calculating a value of a third loss function based on the predicted polyphonic features corresponding to the speech sample sentence and the actual polyphonic features corresponding to the speech sample sentence; and
adjusting neuron parameters in the pitch analysis module based on the value of the third loss function such that the third loss function converges.
10. The method of claim 9, further comprising: determining, with a speech generation module trained from the text-to-speech data set, audio corresponding to the sentence based on the phoneme sequence corresponding to the sentence, the training of the speech generation module comprising:
based on a speech sample sentence in the text-to-speech dataset, utilizing the speech generation module to generate a corresponding predicted audio for the speech sample sentence,
calculating a value of a fourth loss function based on the predicted audio corresponding to the speech sample sentence and the actual audio corresponding to the speech sample sentence; and
based on the value of the fourth loss function, neuron parameters in the speech generation module are adjusted such that the fourth loss function converges.
11. The method of claim 3, wherein the prosodic-directed grammar features are determined by a shared hidden layer and a first convolutional neural network layer, the polyphonic-directed grammar features are determined by the shared hidden layer and a second convolutional neural network, a span of convolutional kernels of the first convolutional neural network is greater than a span of convolutional kernels of the second convolutional neural network.
12. An apparatus for converting text data into a sequence of phonemes, comprising:
an extraction unit configured to extract, based on a sentence in the text data, a sentence semantic feature corresponding to the sentence and a character semantic feature corresponding to one or more consecutive characters in the sentence,
a first determination unit configured to determine a grammatical feature corresponding to the sentence based on a sentence meaning feature corresponding to the sentence,
a second determination unit configured to determine a polyphonic feature indicating polyphonic pronunciation information of a character based on the character semantic feature and a grammatical feature corresponding to the sentence, an
A third determining unit configured to determine a phoneme sequence corresponding to the sentence based on the grammatical feature and the polyphonic feature.
13. The apparatus of claim 12, wherein,
the second determination module is further configured to:
determining grammatical features for polyphonic pronunciations based on the grammatical features corresponding to the sentence, an
Determining polyphonic features based on the grammatical features for polyphonic pronunciation and the character semantic features;
the third determination module is further configured to:
determining grammatical features for prosody based on corresponding grammatical features of the sentence,
and determining a phoneme sequence corresponding to the sentence based on the prosodic grammar characteristics and the polyphonic characteristics.
14. An apparatus for converting text data into a sequence of phonemes, comprising:
a processor; and
memory, wherein the memory has stored therein a computer-executable program that, when executed by the processor, performs the method of any of claims 1-11.
15. A computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the method of any one of claims 1-11.
CN202110832833.4A 2021-07-22 2021-07-22 Method and device for converting text data into phoneme sequence Pending CN113823259A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110832833.4A CN113823259A (en) 2021-07-22 2021-07-22 Method and device for converting text data into phoneme sequence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110832833.4A CN113823259A (en) 2021-07-22 2021-07-22 Method and device for converting text data into phoneme sequence

Publications (1)

Publication Number Publication Date
CN113823259A true CN113823259A (en) 2021-12-21

Family

ID=78912756

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110832833.4A Pending CN113823259A (en) 2021-07-22 2021-07-22 Method and device for converting text data into phoneme sequence

Country Status (1)

Country Link
CN (1) CN113823259A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114999450A (en) * 2022-05-24 2022-09-02 网易有道信息技术(北京)有限公司 Homomorphic and heteromorphic word recognition method and device, electronic equipment and storage medium
CN115329785A (en) * 2022-10-15 2022-11-11 小语智能信息科技(云南)有限公司 Phoneme feature-fused English-Tai-old multi-language neural machine translation method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0831460A2 (en) * 1996-09-24 1998-03-25 Nippon Telegraph And Telephone Corporation Speech synthesis method utilizing auxiliary information
US8510113B1 (en) * 2006-08-31 2013-08-13 At&T Intellectual Property Ii, L.P. Method and system for enhancing a speech database
JP2017208097A (en) * 2016-05-20 2017-11-24 富士通株式会社 Ambiguity avoidance method of polyphonic entity and ambiguity avoidance device of polyphonic entity
CN110782870A (en) * 2019-09-06 2020-02-11 腾讯科技(深圳)有限公司 Speech synthesis method, speech synthesis device, electronic equipment and storage medium
JP2020034883A (en) * 2018-08-27 2020-03-05 日本放送協会 Voice synthesizer and program
CN112528648A (en) * 2020-12-10 2021-03-19 平安科技(深圳)有限公司 Method, device, equipment and storage medium for predicting polyphone pronunciation

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0831460A2 (en) * 1996-09-24 1998-03-25 Nippon Telegraph And Telephone Corporation Speech synthesis method utilizing auxiliary information
US8510113B1 (en) * 2006-08-31 2013-08-13 At&T Intellectual Property Ii, L.P. Method and system for enhancing a speech database
JP2017208097A (en) * 2016-05-20 2017-11-24 富士通株式会社 Ambiguity avoidance method of polyphonic entity and ambiguity avoidance device of polyphonic entity
JP2020034883A (en) * 2018-08-27 2020-03-05 日本放送協会 Voice synthesizer and program
CN110782870A (en) * 2019-09-06 2020-02-11 腾讯科技(深圳)有限公司 Speech synthesis method, speech synthesis device, electronic equipment and storage medium
CN112528648A (en) * 2020-12-10 2021-03-19 平安科技(深圳)有限公司 Method, device, equipment and storage medium for predicting polyphone pronunciation

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
ZHIYONG WU: "controllable emphatic speech synthesis based on forward attention for expressive speech synthesis", IEEE, 25 March 2021 (2021-03-25) *
ZHIYONG WU: "disambiguation of chinese polyphones in and end to end framework with semantic features extracted by pre-trained bert", INTERSPEECH 2019, 19 September 2019 (2019-09-19) *
王国梁等: "一种基于Tacotron 2的端到端中文语音合成方案", 华东师范大学学报(自然科学版), no. 4, 25 July 2019 (2019-07-25) *
郝东亮;杨鸿武;张策;张帅;郭立钊;杨静波;: "面向汉语统计参数语音合成的标注生成方法", 计算机工程与应用, no. 19, 1 October 2016 (2016-10-01) *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114999450A (en) * 2022-05-24 2022-09-02 网易有道信息技术(北京)有限公司 Homomorphic and heteromorphic word recognition method and device, electronic equipment and storage medium
CN115329785A (en) * 2022-10-15 2022-11-11 小语智能信息科技(云南)有限公司 Phoneme feature-fused English-Tai-old multi-language neural machine translation method and device
CN115329785B (en) * 2022-10-15 2023-01-20 小语智能信息科技(云南)有限公司 English-Tai-old multi-language neural machine translation method and device integrated with phoneme characteristics

Similar Documents

Publication Publication Date Title
CN110782870B (en) Speech synthesis method, device, electronic equipment and storage medium
Tan et al. A survey on neural speech synthesis
KR20220035180A (en) Expressive power control in E2E (End-to-end) speech synthesis system
KR20230034423A (en) 2-level speech rhyme transmission
EP4029010B1 (en) Neural text-to-speech synthesis with multi-level context features
CN112352275A (en) Neural text-to-speech synthesis with multi-level textual information
CN115485766A (en) Speech synthesis prosody using BERT models
CN111274807B (en) Text information processing method and device, computer equipment and readable storage medium
WO2022121179A1 (en) Speech synthesis method and apparatus, device, and storage medium
CN113823259A (en) Method and device for converting text data into phoneme sequence
US20230087916A1 (en) Transforming text data into acoustic feature
KR20230158603A (en) Phonemes and graphemes for neural text-to-speech conversion
CN118043885A (en) Contrast twin network for semi-supervised speech recognition
CN116343747A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
Zhao et al. Tibetan Multi-Dialect Speech and Dialect Identity Recognition.
CN116312463A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
Wang et al. Investigation of using continuous representation of various linguistic units in neural network based text-to-speech synthesis
CN115547293A (en) Multi-language voice synthesis method and system based on layered prosody prediction
Akmuradov et al. A novel algorithm for dividing Uzbek language words into syllables for concatenative text-to-speech synthesizer
CN113129862B (en) Voice synthesis method, system and server based on world-tacotron
CN115374784A (en) Chinese named entity recognition method based on multi-mode information selective fusion
Volín et al. Human and transformer-based prosodic phrasing in two speech genres
Bui et al. Multi-task Text Normalization Approach for Speech Synthesis
Boco et al. An End to End Bilingual TTS System for Fongbe and Yoruba
US11817079B1 (en) GAN-based speech synthesis model and training method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination