CN113345409A

CN113345409A - Speech synthesis method, speech synthesis device, electronic equipment and computer-readable storage medium

Info

Publication number: CN113345409A
Application number: CN202110893747.4A
Authority: CN
Inventors: 智鹏鹏; 陈帅婷; 陈昌滨
Original assignee: Beijing Century TAL Education Technology Co Ltd
Current assignee: Beijing Century TAL Education Technology Co Ltd
Priority date: 2021-08-05
Filing date: 2021-08-05
Publication date: 2021-09-03
Anticipated expiration: 2041-08-05
Also published as: CN113345409B

Abstract

The present disclosure provides a speech synthesis method, apparatus, electronic device and computer-readable storage medium, the method comprising: acquiring text information to be converted; the text information to be converted comprises a symbol to be identified; acquiring a preset regular matching rule; converting the symbol to be recognized into text information according to a preset regular matching rule; converting the text information to be converted into complete text information according to the text information corresponding to the symbol to be recognized; and carrying out voice synthesis on the complete text information to generate audio information.

Description

Speech synthesis method, speech synthesis device, electronic equipment and computer-readable storage medium

Technical Field

The present invention relates to the field of speech synthesis technologies, and in particular, to a speech synthesis method and apparatus, an electronic device, and a computer-readable storage medium.

Background

With the application of deep learning, speech synthesis technology is rapidly developing. The speech synthesis technology is used for identifying the content in the text by extracting the text, converting the text content to be synthesized into speech, synthesizing high-naturalness speech by end-to-end speech synthesis, applying the speech to a plurality of scenes and performing clear and perfect expression in the plurality of scenes. The user can directly listen to the required information in various occasions without reading the text.

But aim atIn the conventional speech synthesis technology, there is no speech synthesis device for the question type in the education scene. For example, in a title, the text to be synthesized is doped with vacant information such as "_", "()" or the like "

"etc., and the direct input of the text containing the above symbols for speech synthesis results in that the synthesis result cannot completely express the original topic information.

Disclosure of Invention

According to an aspect of the present disclosure, there is provided a speech synthesis method including:

acquiring text information to be converted; the text information to be converted comprises a symbol to be identified;

acquiring a preset regular matching rule;

converting the symbol to be recognized into text information according to the preset regular matching rule;

converting the text information to be converted into complete text information according to the text information corresponding to the symbol to be recognized;

and carrying out voice synthesis on the complete text information to generate audio information.

According to another aspect of the present disclosure, there is provided a speech synthesis apparatus including:

the first acquisition module is used for acquiring the text information to be converted, wherein the text information to be converted comprises a symbol to be identified;

the second acquisition module is used for acquiring the preset regular matching rule;

the first conversion module is used for converting the symbol to be recognized into text information according to the preset regular matching rule;

the second conversion module is used for converting the text information to be converted into complete text information according to the text information corresponding to the symbol to be recognized; and the number of the first and second groups,

and the voice synthesis module is used for carrying out voice synthesis on the complete text information to generate audio information.

According to another aspect of the present disclosure, there is provided an electronic device including:

a processor; and

a memory for storing a program, wherein the program is stored in the memory,

wherein the program comprises instructions which, when executed by the processor, cause the processor to perform the speech synthesis method of any one of the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the speech synthesis method of any one of the embodiments of the present disclosure.

By means of the voice synthesis method, voice synthesis of the education scene topics can be achieved, and accuracy and completeness of audio meaning generation are guaranteed.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the embodiments and, together with the description, serve to explain the exemplary implementations of the embodiments. The illustrated embodiments are for purposes of illustration only and do not limit the scope of the claims. Throughout the drawings, identical reference numbers designate similar, but not necessarily identical, elements.

Further details, features and advantages of the disclosure are disclosed in the following description of exemplary embodiments, taken in conjunction with the accompanying drawings, in which:

FIG. 1 illustrates a flow diagram of a method of speech synthesis according to some embodiments of the present disclosure;

FIG. 2 shows a schematic diagram of an encoder and decoder architecture of the present disclosure;

FIG. 3 shows a flow diagram of a method of speech synthesis according to further embodiments of the present disclosure;

FIG. 4 is a flowchart illustrating determining and converting output of symbols in text information to be synthesized according to an exemplary embodiment of the present disclosure;

FIG. 5 illustrates a flow diagram of a symbol conversion process in accordance with some embodiments of the present disclosure;

FIG. 6 illustrates a schematic structural diagram of a speech synthesis apparatus according to some embodiments of the present disclosure;

FIG. 7 illustrates a flow diagram of a method of speech synthesis according to some embodiments of the present disclosure;

FIG. 8 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description. It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

End-to-end speech synthesis can synthesize high naturalness speech and apply to a variety of scenes, but at present, in educational scenes, the text to be subjected to speech synthesis is doped with symbols of vacant information such as "()", "_" or "

The direct input and synthesis of the text can cause the problems of information omission, vacancy, incomplete content, semantic error and the like, and the original question information cannot be completely expressed. And the manual escape processing is carried out on the text to be synthesized, so that a large amount of labor and time cost is consumed, and the initial purpose of saving manpower by a speech synthesis technology is not met.

In the present embodiment, a speech synthesis method is provided, which may be used in a smart device, such as a mobile phone, a tablet computer, and the like, fig. 1 shows a flowchart of a speech synthesis method 100 according to an exemplary embodiment of the present disclosure, and as shown in fig. 1, the flowchart of the speech synthesis method 100 includes the following steps:

in step S110, text information to be converted is acquired.

The method for acquiring the text information to be converted may be, for example, inputting the text information to be converted after photographing the text information by the photographing device, and may further include directly inputting an electronic version of the text information to be converted into the client.

In some embodiments, the textual information to be converted may be a Chinese-style of educational scenario. Illustratively, the Chinese question types of the educational scenario may be choice question, fill-in-blank question, short answer question, and the like. In some examples, the character to be recognized is included in a Chinese thematic form of the educational scene. Wherein the symbol to be recognized is presentSymbols difficult to synthesize accurately by speech synthesis techniques, e.g. the symbol to be recognized may be "

"," _"," () ", etc. For example, when the text information to be converted is "a child, carefully observe the laws of numbers"

The number at is

. When the user wants to change the text information, the user can use the text information to be converted to be a friend by the existing voice synthesis technology, the digital rule is carefully observed, and the number at the position is a bracket. And the correct synthesis result of the text information to be converted is' child, carefully observe the number rule, what the number at the question mark is. "

In some embodiments, whether the text information to be converted contains can be treated first "

"or blank symbols such as" () "and" _ "are judged, and when it is judged that the text information to be converted contains"

Or blank symbols such as "()" and "_", the step S120 is executed when the text information to be converted does not contain "

"or blank symbols such as" () "and" _ ", step S120 and step S130 may be skipped and step S140 may be executed. The blank symbol refers to a symbol for expressing specific semantics in the text information to be converted, and illustratively, in an educational scene, correct information can be filled in the position where the blank symbol is located, so that the meaning expressed by the text information to be converted is complete.

In step S120, a preset regular matching rule is acquired.

The regular matching rule can convert the symbol to be recognized according to the position of the symbol to be recognized in the text information to be converted. In some embodiments, the regular matching rule may also convert the symbol to be recognized according to the characters in the text information to be converted.

In some embodiments, the regular matching rule may be input by the user according to the application scenario, in addition to being preset in advance.

In step S130, the symbol to be recognized is converted into text information according to a preset regular matching rule.

In some embodiments, the symbol to be recognized in the text information to be converted is converted according to a preset regular matching rule. Illustratively, the matched symbol to be recognized in the text to be converted can be replaced by the result corresponding to the regular matching rule.

In step S140, the text information to be converted is converted into complete text information according to the text information corresponding to the symbol to be recognized.

The complete text information does not contain the symbol to be recognized, and can be used for performing speech synthesis with complete and accurate meaning.

In step S150, the complete text information is speech-synthesized to generate audio information.

For example, the complete text information may be converted into a phonetic letter, and then input into the encoder, and the decoder decodes the output result of the encoder, so as to convert the generated result of the decoder into an audio output.

In some embodiments, the complete text information may be word-to-sound converted to generate a sequence of phonemes. The phoneme sequence is input into an encoder, and the encoder can adopt 3 layers of 1-dimensional convolutional layers (5 convolutional cores, 512 units) and 1 layer of 256-unit bidirectional Long Short-Term Memory (Bi-directional Long Short-Term Memory, BLSTM) layers, and Chinese characters are embedded (character embedding) and reference encoder output reference embedding (reference embedding) are added to input BLSTM to generate intermediate hidden variables.

In some embodiments, the decoder decodes the encoder output to obtain the melpu. In some examples, decoders may be classified as pre-net, Attention-Recurrent Neural Network (referred to as Attention-RNN for short), Decoder-Recurrent Neural Network (referred to as Decoder-RNN for short). Illustratively, the Decoder-RNN is a two-layer residual Gated cyclic Unit network (residual GRU) containing 256 Gated cyclic Unit networks (GRUs) per layer, and the output of the Decoder-RNN is the sum of the input and the output passing through the GRU Unit. Illustratively, the encoder and decoder may be a sequence-to-sequence (seq 2 seq) architecture, and the encoder and decoder architecture in this embodiment can be seen in fig. 2.

In some embodiments, the decoder decodes the encoder output by a look-ahead mechanism. Illustratively, the Attention mechanism structure may be a layer of RNNs comprising 128 GRUs, with inputs to the Attention mechanism being the pre-net and Attention-RNN outputs. Illustratively, the attention mechanism uses a position sensitive attention mechanism (position sensitive attention) to obtain the alignment feature.

In some embodiments, the attention transition mechanism recursively calculates a modified attention probability for each time step using a forward algorithm, allowing the attention mechanism to make a move-forward or dwell decision at each decoder time step.

Illustratively, the decoder-generated spectrum may be converted to audio via a Grasslin algorithm (griffin-lim) or using a neural vocoder.

In the embodiment, a text analysis step is introduced before the front end of speech synthesis, and complete text information is obtained through the steps of judgment of the symbol to be recognized in the text information to be converted and regular matching translation. The voice synthesis is carried out through the converted complete text information, so that the voice synthesis can be greatly improved, the accuracy of the Chinese question type voice synthesis especially applied to the education scene is ensured, the completeness of the meaning of the voice synthesis result is ensured, the application occasion of the voice synthesis technology is expanded, the time for a user to read the text to obtain information is saved, and the cost for manually marking the Chinese question type voice synthesis in the education scene before going forward is also reduced.

FIG. 3 illustrates a flow diagram of a method 300 of speech synthesis according to further embodiments of the present disclosure. As shown in fig. 3, the method 300 includes the steps of:

in step S302, text information to be converted is acquired.

In step S304, a preset regular matching rule is acquired.

The embodiments of step S302 and step S304 are already explained in step S110 and step S120, and are not repeated herein.

In step S306, the text information to be converted is processed according to the preset regular matching rule "

"carry out regular matching. The implementation process of the embodiment can refer to fig. 4. Detecting whether the text information to be converted includes the symbol to be identified before regular matching "

And so on, when the symbol to be recognized is not included in the text information to be converted, step S306, step S308, step S310 may be skipped and step S312 may be directly performed.

In some embodiments, among the text information to be converted, is treated according to a preset regular matching rule "

"make the transition. Illustratively, when a question mark in the text information to be converted is in the middle of a pair of double quotation marks, and characters and/or symbols except the question mark are not in the middle of the pair of double quotation marks,' is used "

"converts to a question mark. Illustratively, when a question mark appears alone in the text information to be converted, the question mark is converted to null, and the question mark appearing alone is not translated. Illustratively, the text information to be converted is' you can find the rule of the graph change

According to the law "

"should be ()" to "how much you can find the pattern change according to the rule question mark".

In step S308, a _ "in the text information to be converted is matched according to a preset regular matching rule, after the matching is completed, it is checked whether there is a _" in the text information to be converted, and if there is an unmatched _ ", the text information to be converted is classified by the deep neural network and the unmatched _" is converted.

In some embodiments, "_" in the text information to be converted is converted according to a preset regular matching rule. Illustratively, "_" is converted to null when the underline is used as a link symbol in the text information to be converted, and the "_" as a link symbol is not translated. In some examples, the specific form of "_" as a connector symbol may be "first portion _ second portion", and neither the first portion nor the second portion is a symbol or a space in this example. Illustratively, the specific form of "_" as a connection symbol may be "a _ 1", "a _ 2", or the like. Illustratively, the text information to be converted "four factories A _1, A _2, A _3, A _4 on one highway" is converted into "four factories A1, A2, A3, A4 on one highway"

In some embodiments, "_" in the text information to be converted is converted according to a preset regular matching rule. Illustratively, when the character "horizontal line" is included in the text information to be converted, "_" is converted into a horizontal line. Illustratively, the text information "to be converted is filled with an appropriate number on the lower horizontal line according to the rule. 1, 5, 9, 13, _, 21, 25 "is converted to" fill the appropriate number in the underlying crossline according to the rules. 1, 5, 9, 13, horizontal line, 21, 25 "

In some embodiments, [ Mask ] may be used when converting "_" in the text information to be converted through the deep neural network](mask) replace "_" withAnd then replaced. Exemplary, [ Mask ]](mask) may be "

”。

In some embodiments, the text information to be converted may be fed into a trained deep neural network, and the text information to be converted may be classified. Illustratively, words containing context information of text information to be converted may be acquired in parallel by an attention mechanism using a Bidirectional Encoder (BERT) model from converters, and in some examples, the context information may be extracted using an 8-header attention mechanism. Illustratively, the information obtained by the BERT model may be represented by the following equation: encoderbert = BERT (text)

In some embodiments, the core of the deep neural network model may be an encoder similar to a transformer (transformer). For example, the highest flexible maximum transfer function (softmax) probability may be selected as the classification mode group, and after determining the classification information, the text information to be converted is labeled with a corresponding label. In some examples, the classification information may include time, orientation, quantity, thing 4 classes.

Exemplarily, when the classification information is time, "_" is converted into what; when the classification information is the direction, converting the _' into which side; when the classification information is a number, "_" is converted into several. When the classified information is things, judging whether the text information to be converted contains a subject, and converting the _ \; when the text information to be converted contains a subject, judging whether the subject of the text information to be converted is alive or not, converting the _ \; the mask is replaced with the translation result.

In step S310, matching "()" in the text information to be converted according to a preset regular matching rule, after the matching is completed, checking whether there is "()" in the text information to be converted, and if there is "()" that is not matched, classifying the text information to be converted through a deep neural network and converting the "()" that is not matched.

In some embodiments, "()" in the text information to be converted is converted according to a preset regular matching rule. Illustratively, when the symbol to be recognized includes "()", the "()" is converted into parentheses, and no space is included between the left and right parentheses of the "()".

"make the transition. Exemplarily, when "

"appear at the beginning of a sentence, and appear at the beginning of a sentence"

"when there is a blank space in the middle, will"

"who converted to. Illustratively, the text information to be converted "

Can be extended indefinitely much like a gold cudgel. The two ends of which "convert" can be extended indefinitely, much like a gold cudgel. "

"make the transition. Exemplarily, when "

"appear in a sentence, and"

When the rear is the character ' li ', will '

"transition to parentheses. Illustratively, the text information to be converted is "regularly, you know

How many figures should be filled in

"convert to" by law, you know that the number should be filled in the parentheses

”

"make the transition. Exemplarily, when "

"appear at the end of sentence, and"

"when the front is a terminator, will"

"change to empty, not right"

"translate. The terminator indicates the termination of a sentence, and may be, for example, a period, a question mark, an exclamation mark, or the like indicating the end of a sentence. Exemplarily, the text information to be converted is' please choose one to choose one, which traffic sign below has no square woolen cloth

"convert to" please choose one choice, which traffic sign below has no square woolen cloth

”

"make the transition. Exemplarily, when "

"appear at the end of sentence, and"

When the front face is equal number, will "

"to convert to what. Illustratively, the text information "3 plus 5= ()" to be converted is converted to what "3 plus 5 equals".

"make the transition. Illustratively, when "judge size" and appears in the text information to be converted "

The characters before and after are numbers "

"will" without other symbols and/or spaces between the numbers "

"convert to greater than or less than. Exemplarily, the text information to be converted "judge size: 20 minus 3 () 16 "converts to" judge size: 20 minus 3 is greater or less than 16 ".

"make the transition. Exemplarily, when "

"when both front and back are numbers, will"

"why the conversion. In some examples,' A "

The symbol may be included between "and the number. Illustratively, the text information "2, 3, 7, 4, 4, 9, 6, 5, 11,

，

the "10, 7, 15" is converted to "2, 3, 7, 4, 4, 9, 6, 5, 11, what, 10, 7, 15".

In some embodiments, [ Mask ] may be used when converting "()" in the text information to be converted through a deep neural network](Mask) replace "()", exemplary, [ Mask](mask) may be "

”。

In some embodiments, the text information to be converted is classified by feeding the text information to be converted into a trained deep neural network. Illustratively, the BERT model may be used to obtain words containing context information of the textual information to be converted in parallel by an attention mechanism, and in some examples, the context information may be extracted using an 8-headed self-attention mechanism. Illustratively, the information obtained by the BERT model may be represented by the following equation: encoderbert = BERT (text)

In some embodiments, the core of the deep neural network model may be a transform-like encoder. For example, the highest softmax probability may be selected as the classification mode group, and after determining the classification information, the text information to be converted is labeled with the corresponding label. In some examples, the classification information includes time, orientation, quantity, thing 4 classes.

Illustratively, when the classification information is time, "()" is converted into what; when the classification information is the azimuth, converting the '()' into which side; when the classification information is a quantity, "()" is converted into a few. When the classified information is things, judging whether the text information to be converted contains a subject or not, and when the text information to be converted does not contain the subject, converting the "()" into the reason; when the text information to be converted contains a subject, judging whether the subject of the text information to be converted is alive or not, converting the '()' into which the subject is alive when the text information to be converted contains the subject without the life, and converting the '()' into which the subject is alive when the text information to be converted contains the subject with the life; the mask is replaced with the translation result.

In some embodiments, the deep neural network model may be trained by using a LOSS (LOSS) function. Illustratively, the loss function may be a cross-entropy loss function in a multi-classification task. In some embodiments, the loss function may satisfy the following equation:

in the formula: m represents the number of categories; yic denotes a sign function, taking 1 if the true class of sample i equals c, otherwise 0; pic denotes the predicted probability that the observed sample i belongs to class c.

In step S308 and step S310, the process of converting the blank symbols in the text information to be converted according to the result of the deep learning is performed according to that most of the texts with the blank symbols in the education scene are questioned sentences, so that the blank symbols are equivalent to questioned pronouns in the sentences. The number of modern Chinese query pronouns is mainly 16, which are respectively as follows: to ask things, time, places and quantity, there are mainly 8: who, what, where, time, place, and size; the main questions about the mode, character and reason are 8: what, why; there are 4 main word types: do, woo, bar, o, question adverbs mainly have 10: difficult to do, stay, sure, simple, strange, reverse, what taste, what must. In this study we mainly aimed at educational scenarios and through the analysis of text we found that the meaning of the blank symbols in the text can be summarized in the following categories: which, what, how many, who, how much.

Therefore, in the present alternative embodiment, the input text is classified by using a deep learning method, and the specific implementation is as shown in table 1, where the classification includes 4 categories of things, time, orientation, and quantity:

in step S312, the text information to be converted is converted into complete text information according to the classification conversion result of the canonical matching and the deep neural network.

In step S314, the complete text information is subjected to speech synthesis to generate audio information.

The embodiments of step S312 and step S314 are already explained in step S140 and step S150, and are not repeated herein.

In some embodiments, the symbol to be recognized in the topic in the educational scene is not in the preset regular matching rule, and the symbol to be recognized in the text information to be synthesized cannot be accurately converted only by the regular matching rule. Therefore, in the embodiment, after the characters to be recognized in the text information to be recognized are converted according to the regular matching rule, the symbols to be recognized, which do not conform to the regular matching rule and are not converted, are converted through the trained deep neural network model. The embodiment can ensure that the symbols to be recognized in the text information to be converted are completely converted, and further improve the accuracy of symbol conversion, thereby further improving the integrity of the voice synthesis result and expanding the application range of the disclosure.

The present embodiment further provides a flowchart of a method 500 for combining a regular matching rule and a deep neural network, where the method is used to implement the symbol conversion process of the foregoing embodiment, as shown in fig. 5, and includes:

step S501, inputting text information to be converted; the text information to be converted may include a symbol to be recognized.

Step S502, judging whether the text information to be converted includes the symbol to be identified; in some embodiments, the symbol to be recognized may be "

"," _ "or" () "; and when the text information to be converted does not comprise the symbol to be identified, directly outputting the audio information corresponding to the text information to be converted.

Step S503, when the text information to be converted includes the symbol to be recognized, such as "

When the text information is converted according to the obtained regular matching rule "

”。

Step S504, after the question mark in the text information to be converted is subjected to regular matching, whether the text information to be converted also comprises a symbol to be identified is judged; wherein, the symbol to be recognized may be "_" or "()"; and when the text information to be converted does not comprise the symbol to be recognized, directly outputting the audio information corresponding to the text information to be converted.

And step S505, converting the _' in the text information to be converted according to the acquired regular matching rule.

Step S506, determining whether the text information to be converted further includes "_", and when the text information to be converted does not include "_", skipping step S507 and executing step S508.

Step S507, when the text information to be converted also comprises _ ', the _' which does not accord with the regular matching rule in the text information to be converted is converted through the deep neural network model.

Step S508, judge whether to include "()" in the text message to be converted; when the text information to be converted does not include "()", the text information to be converted is directly output.

Step S509, converts "()" in the text information to be converted according to the obtained regular matching rule.

Step S510, determining whether the text information to be converted further includes "()", and directly outputting the text information to be converted when the text information to be converted does not include "()".

And step S511, when the text information to be converted also comprises "()", converting the "()" which does not accord with the regular matching rule in the text information to be converted through the deep neural network model.

And S512, outputting the audio information corresponding to the converted text information.

The embodiment provides a method for combining the regular matching and the deep neural network, which can ensure that all the symbols to be recognized in the text information to be converted are converted, prevent the occurrence of the condition of missing the symbols to be recognized, and further improve the accuracy of speech synthesis.

In this embodiment, a speech synthesis apparatus is further provided, and the speech synthesis apparatus is used to implement the foregoing embodiments and preferred embodiments, which have already been described and are not described again. As used hereinafter, the term "module" is a combination of software and/or hardware that can implement a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

The present embodiment provides a speech synthesis apparatus 600, as shown in fig. 6, including:

a first obtaining module 610, configured to obtain text information to be converted; the text information to be converted comprises a symbol to be identified;

a second obtaining module 620, configured to obtain a preset regular matching rule;

a first conversion module 630, configured to convert the symbol to be recognized into text information according to the preset regular matching rule;

the second conversion module 640 is configured to convert the text information to be converted into complete text information according to the text information corresponding to the symbol to be recognized;

and the speech synthesis module 650 is configured to perform speech synthesis on the complete text information to generate audio information.

Optionally, the first conversion module is further configured to:

in said symbol to be recognized comprising'

When it is used, it will "

"converts to a question mark; alternatively, the first and second electrodes may be,

in said symbol to be recognized comprising "

When, will "

"convert to empty.

Optionally, when the symbol to be identified includes "_", the first conversion module is further configured to:

converting "_" to null upon detecting that the form including "_" portion is "first portion _ second portion"; alternatively, the first and second electrodes may be,

and when the text information to be converted comprises the horizontal line, converting the _' into the horizontal line.

Optionally, the first conversion module is further configured to:

when the symbol to be identified includes "()", converting "()" into a parenthesis; alternatively, the first and second electrodes may be,

in said symbol to be recognized comprising "

"time, detect"

"a position in the text information to be converted; indicating at said position'

"when it is at the beginning of a sentence, will"

"convert to who; indicating at said position'

When "is in a sentence followed by" in ", it will"

"transition to parentheses; indicating at said position'

"is located at the end of sentence and"

"when the front is a terminator, will"

"convert to empty; indicating at said position'

"is located at the end of sentence and"

When the front face is equal in number, will "

"how much to convert to; alternatively, the first and second electrodes may be,

in said symbol to be recognized comprising "

And the text information to be converted includes "judge size", and "

"when the front and back are numbers, will"

"convert to greater than or less than;

in said symbol to be recognized comprising "

", and"

"when the front and back are numbers, will"

"why the conversion.

Optionally, the symbol to be recognized includes a blank symbol; wherein the blank symbols comprise "

"and/or" _ ", the apparatus further comprising:

the classification module is used for inputting the text information to be converted into a trained deep neural network model and classifying the text information to be converted to obtain classification information; wherein the classification information includes: things, time, orientation, quantity;

the third conversion module is used for converting the symbol to be recognized into text information according to the classification information;

the trained deep neural network model is obtained by training in the following way:

obtaining sample text information and a classification label corresponding to the sample text information;

and training a deep neural network model by using the sample text information and the classification labels to obtain the trained deep neural network model.

Optionally, the classification module comprises:

the replacing unit is used for replacing the blank symbols in the text information to be converted with masks;

the acquiring unit is used for acquiring the context information of the text information to be converted;

and the classification unit is used for classifying the text information to be converted through an attention mechanism to obtain the classification information.

Optionally, the classification module is further configured to:

converting the blank symbol into how many when the classification information includes time;

when the classification information comprises the direction, converting the blank symbol into which side;

converting the blank symbol into several when the classification information includes a quantity.

Optionally, when the classification information includes things, the third conversion module is further configured to:

detecting whether the subject of the text information to be converted is alive or not; converting the blank symbol into who when the subject is alive; converting the blank symbol into which when the subject is not alive. Alternatively, the first and second electrodes may be,

and detecting why the blank symbol is converted when the text information to be converted does not contain the subject. The speech synthesis apparatus in this embodiment is presented in the form of functional units, where a unit refers to an ASIC circuit, a processor and memory executing one or more software or fixed programs, and/or other devices that may provide the above-described functionality.

The embodiment provided by the disclosure can perform complete and accurate voice synthesis on the educational scene question. By combining the regular matching method and the deep learning method, the accuracy of a voice synthesis result is ensured, and the attention transition mechanism is introduced, so that the long text synthesis can be realized. By using the voice recognition method provided by the disclosure, a user can acquire all information of the text without reading, so that the time for reading the text by the user is saved. In the application of the educational scene, the learning efficiency can be further improved.

The complete steps 700 of the exemplary embodiment of the present disclosure in the application of the educational scenario can refer to fig. 7, which includes:

step S701, inputting a question text.

Step S702, inputting the question type text into a text analysis module.

And step S703, predicting the symbol through the text analysis module, and predicting the meaning corresponding to the symbol in the question text.

Step S704, for the _ "" in the question type text "

"wait for blank symbol to predict.

Step S705, a complete semantic text is obtained according to the prediction result.

Step S706, the complete semantic text is input into a speech synthesis front-end module to perform word-sound conversion, so as to obtain a phoneme sequence.

In step S707, the front-end module inputs the phoneme sequence into the encoder, and encodes the output of the front-end module through the encoder.

In step S708, the attention module extracts information in the phoneme sequence in parallel.

In step S709, the attention transition module controls the attention to advance or stop at each time step of the encoder according to the output of the attention module.

In step S710, the decoder decodes the output of the encoder according to the output of the attention module.

In step S711, the decoder outputs a melpu result.

In step S712, the mel output from the decoder is input into the vocoder module.

In step S713, the vocoder generates audio according to mel.

The embodiment provides an exemplary complete process of applying the speech synthesis method disclosed by the disclosure in an educational scene, and the steps described in the embodiment can realize a speech synthesis function for a topic-type text in the educational scene.

An exemplary embodiment of the present disclosure also provides an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor. The memory stores a computer program executable by the at least one processor, the computer program, when executed by the at least one processor, is for causing the electronic device to perform a method according to an embodiment of the disclosure.

The disclosed exemplary embodiments also provide a non-transitory computer readable storage medium storing a computer program, wherein the computer program, when executed by a processor of a computer, is adapted to cause the computer to perform a method according to an embodiment of the present disclosure.

The exemplary embodiments of the present disclosure also provide a computer program product comprising a computer program, wherein the computer program, when executed by a processor of a computer, is adapted to cause the computer to perform a method according to an embodiment of the present disclosure.

Referring to fig. 8, a block diagram of a structure of an electronic device 800, which may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic device is intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the electronic device 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The calculation unit 801, the ROM 802, and the RAM 803 are connected to each other via a bus 504. An input/output (I/O) interface 805 is also connected to bus 804.

A number of components in the electronic device 800 are connected to the I/O interface 805, including: an input unit 806, an output unit 807, a storage unit 808, and a communication unit 809. The input unit 806 may be any type of device capable of inputting information to the electronic device 800, and the input unit 806 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device. Output unit 807 can be any type of device capable of presenting information and can include, but is not limited to, a display, speakers, a video/audio output terminal, a vibrator, and/or a printer. The storage unit 804 may include, but is not limited to, a magnetic disk, an optical disk. The communication unit 809 allows the electronic device 800 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth (TM) devices, WiFi devices, WiMax devices, cellular communication devices, and/or the like.

Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 801 executes the respective methods and processes described above. For example, in some embodiments, method 100 and/or method 200 may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 808. In some embodiments, part or all of the computer program can be loaded and/or installed onto the electronic device 800 via the ROM 802 and/or the communication unit 809. In some embodiments, the computing unit 801 may be configured to perform the method 100 and/or the method 200 by any other suitable means (e.g., by means of firmware).

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

As used in this disclosure, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

Claims

1. A method of speech synthesis comprising:

acquiring a preset regular matching rule;

2. The speech synthesis method according to claim 1, wherein converting the symbol to be recognized into text information according to the preset regular matching rule comprises:

in said symbol to be recognized comprising'

When it is used, it will "

in said symbol to be recognized comprising "

When, will "

"convert to empty.

3. The speech synthesis method according to claim 1, wherein, when the symbol to be recognized includes "_", converting the symbol to be recognized into text information according to the preset regular matching rule includes:

4. The speech synthesis method according to claim 1, wherein converting the symbol to be recognized into text information according to the preset regular matching rule comprises:

in said symbol to be recognized comprising "

"time, detect"

"when it is at the beginning of a sentence, will"

"convert to who; indicating at said position'

When "is in a sentence followed by" in ", it will"

"transition to parentheses; indicating at said position'

"is located at the end of sentence and"

"when the front is a terminator, will"

"convert to empty; indicating at said position'

"is located at the end of sentence and"

When the front face is equal in number, will "

"how much to convert to; alternatively, the first and second electrodes may be,

in said symbol to be recognized comprising "

And the text information to be converted includes "judge size", and "

"when the front and back are numbers, will"

"convert to greater than or less than;

in said symbol to be recognized comprising "

", and"

"when the front and back are numbers, will"

"why the conversion.

5. The speech synthesis method according to claim 1, wherein the symbol to be recognized comprises a blank symbol; wherein the blank symbols comprise "

"and/or" _ ", the method further comprising:

inputting the text information to be converted into a trained deep neural network model, and classifying the text information to be converted to obtain classified information; wherein the classification information includes: things, time, orientation, quantity;

converting the symbol to be recognized into text information according to the classification information;

6. The speech synthesis method according to claim 5, wherein classifying the text information to be converted comprises:

replacing the blank symbols in the text information to be converted with masks;

acquiring context information of the text information to be converted;

and classifying the text information to be converted through an attention mechanism to obtain the classification information.

7. The speech synthesis method according to claim 5, wherein converting the symbol to be recognized into text information according to the classification information comprises:

8. The speech synthesis method according to claim 5, wherein, when the classification information includes things, converting the symbol to be recognized into text information according to the classification information includes:

detecting whether the subject of the text information to be converted is alive or not; converting the blank symbol into who when the subject is alive; converting the blank symbol into which when the subject is not alive; alternatively, the first and second electrodes may be,

and detecting why the blank symbol is converted when the text information to be converted does not contain the subject.

9. A speech synthesis apparatus comprising:

the first acquisition module is used for acquiring text information to be converted; the text information to be converted comprises a symbol to be identified;

the second acquisition module is used for acquiring a preset regular matching rule;

the second conversion module is used for converting the text information to be converted into complete text information according to the text information corresponding to the symbol to be recognized;

10. An electronic device, comprising:

a processor; and

a memory for storing a program, wherein the program is stored in the memory,

wherein the program comprises instructions which, when executed by the processor, cause the processor to carry out the method according to any one of claims 1-8.

11. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-8.