CN113345409A - Speech synthesis method, speech synthesis device, electronic equipment and computer-readable storage medium - Google Patents

Speech synthesis method, speech synthesis device, electronic equipment and computer-readable storage medium Download PDF

Info

Publication number
CN113345409A
CN113345409A CN202110893747.4A CN202110893747A CN113345409A CN 113345409 A CN113345409 A CN 113345409A CN 202110893747 A CN202110893747 A CN 202110893747A CN 113345409 A CN113345409 A CN 113345409A
Authority
CN
China
Prior art keywords
text information
converted
symbol
recognized
converting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110893747.4A
Other languages
Chinese (zh)
Other versions
CN113345409B (en
Inventor
智鹏鹏
陈帅婷
陈昌滨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Century TAL Education Technology Co Ltd
Original Assignee
Beijing Century TAL Education Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Century TAL Education Technology Co Ltd filed Critical Beijing Century TAL Education Technology Co Ltd
Priority to CN202110893747.4A priority Critical patent/CN113345409B/en
Publication of CN113345409A publication Critical patent/CN113345409A/en
Application granted granted Critical
Publication of CN113345409B publication Critical patent/CN113345409B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L2013/083Special characters, e.g. punctuation marks

Abstract

The present disclosure provides a speech synthesis method, apparatus, electronic device and computer-readable storage medium, the method comprising: acquiring text information to be converted; the text information to be converted comprises a symbol to be identified; acquiring a preset regular matching rule; converting the symbol to be recognized into text information according to a preset regular matching rule; converting the text information to be converted into complete text information according to the text information corresponding to the symbol to be recognized; and carrying out voice synthesis on the complete text information to generate audio information.

Description

Speech synthesis method, speech synthesis device, electronic equipment and computer-readable storage medium
Technical Field
The present invention relates to the field of speech synthesis technologies, and in particular, to a speech synthesis method and apparatus, an electronic device, and a computer-readable storage medium.
Background
With the application of deep learning, speech synthesis technology is rapidly developing. The speech synthesis technology is used for identifying the content in the text by extracting the text, converting the text content to be synthesized into speech, synthesizing high-naturalness speech by end-to-end speech synthesis, applying the speech to a plurality of scenes and performing clear and perfect expression in the plurality of scenes. The user can directly listen to the required information in various occasions without reading the text.
But aim atIn the conventional speech synthesis technology, there is no speech synthesis device for the question type in the education scene. For example, in a title, the text to be synthesized is doped with vacant information such as "_", "()" or the like "
Figure 493984DEST_PATH_IMAGE001
"etc., and the direct input of the text containing the above symbols for speech synthesis results in that the synthesis result cannot completely express the original topic information.
Disclosure of Invention
According to an aspect of the present disclosure, there is provided a speech synthesis method including:
acquiring text information to be converted; the text information to be converted comprises a symbol to be identified;
acquiring a preset regular matching rule;
converting the symbol to be recognized into text information according to the preset regular matching rule;
converting the text information to be converted into complete text information according to the text information corresponding to the symbol to be recognized;
and carrying out voice synthesis on the complete text information to generate audio information.
According to another aspect of the present disclosure, there is provided a speech synthesis apparatus including:
the first acquisition module is used for acquiring the text information to be converted, wherein the text information to be converted comprises a symbol to be identified;
the second acquisition module is used for acquiring the preset regular matching rule;
the first conversion module is used for converting the symbol to be recognized into text information according to the preset regular matching rule;
the second conversion module is used for converting the text information to be converted into complete text information according to the text information corresponding to the symbol to be recognized; and the number of the first and second groups,
and the voice synthesis module is used for carrying out voice synthesis on the complete text information to generate audio information.
According to another aspect of the present disclosure, there is provided an electronic device including:
a processor; and
a memory for storing a program, wherein the program is stored in the memory,
wherein the program comprises instructions which, when executed by the processor, cause the processor to perform the speech synthesis method of any one of the embodiments of the present disclosure.
According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the speech synthesis method of any one of the embodiments of the present disclosure.
By means of the voice synthesis method, voice synthesis of the education scene topics can be achieved, and accuracy and completeness of audio meaning generation are guaranteed.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the embodiments and, together with the description, serve to explain the exemplary implementations of the embodiments. The illustrated embodiments are for purposes of illustration only and do not limit the scope of the claims. Throughout the drawings, identical reference numbers designate similar, but not necessarily identical, elements.
Further details, features and advantages of the disclosure are disclosed in the following description of exemplary embodiments, taken in conjunction with the accompanying drawings, in which:
FIG. 1 illustrates a flow diagram of a method of speech synthesis according to some embodiments of the present disclosure;
FIG. 2 shows a schematic diagram of an encoder and decoder architecture of the present disclosure;
FIG. 3 shows a flow diagram of a method of speech synthesis according to further embodiments of the present disclosure;
FIG. 4 is a flowchart illustrating determining and converting output of symbols in text information to be synthesized according to an exemplary embodiment of the present disclosure;
FIG. 5 illustrates a flow diagram of a symbol conversion process in accordance with some embodiments of the present disclosure;
FIG. 6 illustrates a schematic structural diagram of a speech synthesis apparatus according to some embodiments of the present disclosure;
FIG. 7 illustrates a flow diagram of a method of speech synthesis according to some embodiments of the present disclosure;
FIG. 8 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.
It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.
The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description. It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.
It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.
The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.
End-to-end speech synthesis can synthesize high naturalness speech and apply to a variety of scenes, but at present, in educational scenes, the text to be subjected to speech synthesis is doped with symbols of vacant information such as "()", "_" or "
Figure 879966DEST_PATH_IMAGE001
The direct input and synthesis of the text can cause the problems of information omission, vacancy, incomplete content, semantic error and the like, and the original question information cannot be completely expressed. And the manual escape processing is carried out on the text to be synthesized, so that a large amount of labor and time cost is consumed, and the initial purpose of saving manpower by a speech synthesis technology is not met.
In the present embodiment, a speech synthesis method is provided, which may be used in a smart device, such as a mobile phone, a tablet computer, and the like, fig. 1 shows a flowchart of a speech synthesis method 100 according to an exemplary embodiment of the present disclosure, and as shown in fig. 1, the flowchart of the speech synthesis method 100 includes the following steps:
in step S110, text information to be converted is acquired.
The method for acquiring the text information to be converted may be, for example, inputting the text information to be converted after photographing the text information by the photographing device, and may further include directly inputting an electronic version of the text information to be converted into the client.
In some embodiments, the textual information to be converted may be a Chinese-style of educational scenario. Illustratively, the Chinese question types of the educational scenario may be choice question, fill-in-blank question, short answer question, and the like. In some examples, the character to be recognized is included in a Chinese thematic form of the educational scene. Wherein the symbol to be recognized is presentSymbols difficult to synthesize accurately by speech synthesis techniques, e.g. the symbol to be recognized may be "
Figure 128545DEST_PATH_IMAGE001
"," _"," () ", etc. For example, when the text information to be converted is "a child, carefully observe the laws of numbers"
Figure 727017DEST_PATH_IMAGE001
The number at is
Figure 10230DEST_PATH_IMAGE002
. When the user wants to change the text information, the user can use the text information to be converted to be a friend by the existing voice synthesis technology, the digital rule is carefully observed, and the number at the position is a bracket. And the correct synthesis result of the text information to be converted is' child, carefully observe the number rule, what the number at the question mark is. "
In some embodiments, whether the text information to be converted contains can be treated first "
Figure 98272DEST_PATH_IMAGE001
"or blank symbols such as" () "and" _ "are judged, and when it is judged that the text information to be converted contains"
Figure 834147DEST_PATH_IMAGE001
Or blank symbols such as "()" and "_", the step S120 is executed when the text information to be converted does not contain "
Figure 501889DEST_PATH_IMAGE001
"or blank symbols such as" () "and" _ ", step S120 and step S130 may be skipped and step S140 may be executed. The blank symbol refers to a symbol for expressing specific semantics in the text information to be converted, and illustratively, in an educational scene, correct information can be filled in the position where the blank symbol is located, so that the meaning expressed by the text information to be converted is complete.
In step S120, a preset regular matching rule is acquired.
The regular matching rule can convert the symbol to be recognized according to the position of the symbol to be recognized in the text information to be converted. In some embodiments, the regular matching rule may also convert the symbol to be recognized according to the characters in the text information to be converted.
In some embodiments, the regular matching rule may be input by the user according to the application scenario, in addition to being preset in advance.
In step S130, the symbol to be recognized is converted into text information according to a preset regular matching rule.
In some embodiments, the symbol to be recognized in the text information to be converted is converted according to a preset regular matching rule. Illustratively, the matched symbol to be recognized in the text to be converted can be replaced by the result corresponding to the regular matching rule.
In step S140, the text information to be converted is converted into complete text information according to the text information corresponding to the symbol to be recognized.
The complete text information does not contain the symbol to be recognized, and can be used for performing speech synthesis with complete and accurate meaning.
In step S150, the complete text information is speech-synthesized to generate audio information.
For example, the complete text information may be converted into a phonetic letter, and then input into the encoder, and the decoder decodes the output result of the encoder, so as to convert the generated result of the decoder into an audio output.
In some embodiments, the complete text information may be word-to-sound converted to generate a sequence of phonemes. The phoneme sequence is input into an encoder, and the encoder can adopt 3 layers of 1-dimensional convolutional layers (5 convolutional cores, 512 units) and 1 layer of 256-unit bidirectional Long Short-Term Memory (Bi-directional Long Short-Term Memory, BLSTM) layers, and Chinese characters are embedded (character embedding) and reference encoder output reference embedding (reference embedding) are added to input BLSTM to generate intermediate hidden variables.
In some embodiments, the decoder decodes the encoder output to obtain the melpu. In some examples, decoders may be classified as pre-net, Attention-Recurrent Neural Network (referred to as Attention-RNN for short), Decoder-Recurrent Neural Network (referred to as Decoder-RNN for short). Illustratively, the Decoder-RNN is a two-layer residual Gated cyclic Unit network (residual GRU) containing 256 Gated cyclic Unit networks (GRUs) per layer, and the output of the Decoder-RNN is the sum of the input and the output passing through the GRU Unit. Illustratively, the encoder and decoder may be a sequence-to-sequence (seq 2 seq) architecture, and the encoder and decoder architecture in this embodiment can be seen in fig. 2.
In some embodiments, the decoder decodes the encoder output by a look-ahead mechanism. Illustratively, the Attention mechanism structure may be a layer of RNNs comprising 128 GRUs, with inputs to the Attention mechanism being the pre-net and Attention-RNN outputs. Illustratively, the attention mechanism uses a position sensitive attention mechanism (position sensitive attention) to obtain the alignment feature.
In some embodiments, the attention transition mechanism recursively calculates a modified attention probability for each time step using a forward algorithm, allowing the attention mechanism to make a move-forward or dwell decision at each decoder time step.
Illustratively, the decoder-generated spectrum may be converted to audio via a Grasslin algorithm (griffin-lim) or using a neural vocoder.
In the embodiment, a text analysis step is introduced before the front end of speech synthesis, and complete text information is obtained through the steps of judgment of the symbol to be recognized in the text information to be converted and regular matching translation. The voice synthesis is carried out through the converted complete text information, so that the voice synthesis can be greatly improved, the accuracy of the Chinese question type voice synthesis especially applied to the education scene is ensured, the completeness of the meaning of the voice synthesis result is ensured, the application occasion of the voice synthesis technology is expanded, the time for a user to read the text to obtain information is saved, and the cost for manually marking the Chinese question type voice synthesis in the education scene before going forward is also reduced.
FIG. 3 illustrates a flow diagram of a method 300 of speech synthesis according to further embodiments of the present disclosure. As shown in fig. 3, the method 300 includes the steps of:
in step S302, text information to be converted is acquired.
In step S304, a preset regular matching rule is acquired.
The embodiments of step S302 and step S304 are already explained in step S110 and step S120, and are not repeated herein.
In step S306, the text information to be converted is processed according to the preset regular matching rule "
Figure 374030DEST_PATH_IMAGE001
"carry out regular matching. The implementation process of the embodiment can refer to fig. 4. Detecting whether the text information to be converted includes the symbol to be identified before regular matching "
Figure 836235DEST_PATH_IMAGE001
And so on, when the symbol to be recognized is not included in the text information to be converted, step S306, step S308, step S310 may be skipped and step S312 may be directly performed.
In some embodiments, among the text information to be converted, is treated according to a preset regular matching rule "
Figure 59406DEST_PATH_IMAGE001
"make the transition. Illustratively, when a question mark in the text information to be converted is in the middle of a pair of double quotation marks, and characters and/or symbols except the question mark are not in the middle of the pair of double quotation marks,' is used "
Figure 61997DEST_PATH_IMAGE001
"converts to a question mark. Illustratively, when a question mark appears alone in the text information to be converted, the question mark is converted to null, and the question mark appearing alone is not translated. Illustratively, the text information to be converted is' you can find the rule of the graph change
Figure 54224DEST_PATH_IMAGE001
According to the law "
Figure 687331DEST_PATH_IMAGE001
"should be ()" to "how much you can find the pattern change according to the rule question mark".
In step S308, a _ "in the text information to be converted is matched according to a preset regular matching rule, after the matching is completed, it is checked whether there is a _" in the text information to be converted, and if there is an unmatched _ ", the text information to be converted is classified by the deep neural network and the unmatched _" is converted.
In some embodiments, "_" in the text information to be converted is converted according to a preset regular matching rule. Illustratively, "_" is converted to null when the underline is used as a link symbol in the text information to be converted, and the "_" as a link symbol is not translated. In some examples, the specific form of "_" as a connector symbol may be "first portion _ second portion", and neither the first portion nor the second portion is a symbol or a space in this example. Illustratively, the specific form of "_" as a connection symbol may be "a _ 1", "a _ 2", or the like. Illustratively, the text information to be converted "four factories A _1, A _2, A _3, A _4 on one highway" is converted into "four factories A1, A2, A3, A4 on one highway"
In some embodiments, "_" in the text information to be converted is converted according to a preset regular matching rule. Illustratively, when the character "horizontal line" is included in the text information to be converted, "_" is converted into a horizontal line. Illustratively, the text information "to be converted is filled with an appropriate number on the lower horizontal line according to the rule. 1, 5, 9, 13, _, 21, 25 "is converted to" fill the appropriate number in the underlying crossline according to the rules. 1, 5, 9, 13, horizontal line, 21, 25 "
In some embodiments, [ Mask ] may be used when converting "_" in the text information to be converted through the deep neural network](mask) replace "_" withAnd then replaced. Exemplary, [ Mask ]](mask) may be "
Figure 397798DEST_PATH_IMAGE003
”。
In some embodiments, the text information to be converted may be fed into a trained deep neural network, and the text information to be converted may be classified. Illustratively, words containing context information of text information to be converted may be acquired in parallel by an attention mechanism using a Bidirectional Encoder (BERT) model from converters, and in some examples, the context information may be extracted using an 8-header attention mechanism. Illustratively, the information obtained by the BERT model may be represented by the following equation: encoderbert = BERT (text)
In some embodiments, the core of the deep neural network model may be an encoder similar to a transformer (transformer). For example, the highest flexible maximum transfer function (softmax) probability may be selected as the classification mode group, and after determining the classification information, the text information to be converted is labeled with a corresponding label. In some examples, the classification information may include time, orientation, quantity, thing 4 classes.
Exemplarily, when the classification information is time, "_" is converted into what; when the classification information is the direction, converting the _' into which side; when the classification information is a number, "_" is converted into several. When the classified information is things, judging whether the text information to be converted contains a subject, and converting the _ \; when the text information to be converted contains a subject, judging whether the subject of the text information to be converted is alive or not, converting the _ \; the mask is replaced with the translation result.
In step S310, matching "()" in the text information to be converted according to a preset regular matching rule, after the matching is completed, checking whether there is "()" in the text information to be converted, and if there is "()" that is not matched, classifying the text information to be converted through a deep neural network and converting the "()" that is not matched.
In some embodiments, "()" in the text information to be converted is converted according to a preset regular matching rule. Illustratively, when the symbol to be recognized includes "()", the "()" is converted into parentheses, and no space is included between the left and right parentheses of the "()".
In some embodiments, among the text information to be converted, is treated according to a preset regular matching rule "
Figure 141763DEST_PATH_IMAGE002
"make the transition. Exemplarily, when "
Figure 50813DEST_PATH_IMAGE004
"appear at the beginning of a sentence, and appear at the beginning of a sentence"
Figure 854821DEST_PATH_IMAGE002
"when there is a blank space in the middle, will"
Figure 787005DEST_PATH_IMAGE002
"who converted to. Illustratively, the text information to be converted "
Figure 334661DEST_PATH_IMAGE002
Can be extended indefinitely much like a gold cudgel. The two ends of which "convert" can be extended indefinitely, much like a gold cudgel. "
In some embodiments, among the text information to be converted, is treated according to a preset regular matching rule "
Figure 98217DEST_PATH_IMAGE002
"make the transition. Exemplarily, when "
Figure 338706DEST_PATH_IMAGE004
"appear in a sentence, and"
Figure 758186DEST_PATH_IMAGE002
When the rear is the character ' li ', will '
Figure 843954DEST_PATH_IMAGE002
"transition to parentheses. Illustratively, the text information to be converted is "regularly, you know
Figure 665279DEST_PATH_IMAGE002
How many figures should be filled in
Figure 873407DEST_PATH_IMAGE001
"convert to" by law, you know that the number should be filled in the parentheses
Figure 780183DEST_PATH_IMAGE001
In some embodiments, among the text information to be converted, is treated according to a preset regular matching rule "
Figure 669641DEST_PATH_IMAGE002
"make the transition. Exemplarily, when "
Figure 345473DEST_PATH_IMAGE004
"appear at the end of sentence, and"
Figure 927764DEST_PATH_IMAGE002
"when the front is a terminator, will"
Figure 118574DEST_PATH_IMAGE002
"change to empty, not right"
Figure 811724DEST_PATH_IMAGE002
"translate. The terminator indicates the termination of a sentence, and may be, for example, a period, a question mark, an exclamation mark, or the like indicating the end of a sentence. Exemplarily, the text information to be converted is' please choose one to choose one, which traffic sign below has no square woolen cloth
Figure 342062DEST_PATH_IMAGE001
Figure 829675DEST_PATH_IMAGE002
"convert to" please choose one choice, which traffic sign below has no square woolen cloth
Figure 711044DEST_PATH_IMAGE001
In some embodiments, among the text information to be converted, is treated according to a preset regular matching rule "
Figure 207884DEST_PATH_IMAGE002
"make the transition. Exemplarily, when "
Figure 655046DEST_PATH_IMAGE004
"appear at the end of sentence, and"
Figure 313560DEST_PATH_IMAGE002
When the front face is equal number, will "
Figure 416646DEST_PATH_IMAGE002
"to convert to what. Illustratively, the text information "3 plus 5= ()" to be converted is converted to what "3 plus 5 equals".
In some embodiments, among the text information to be converted, is treated according to a preset regular matching rule "
Figure 451598DEST_PATH_IMAGE002
"make the transition. Illustratively, when "judge size" and appears in the text information to be converted "
Figure 18845DEST_PATH_IMAGE002
The characters before and after are numbers "
Figure 848261DEST_PATH_IMAGE002
"will" without other symbols and/or spaces between the numbers "
Figure 704222DEST_PATH_IMAGE002
"convert to greater than or less than. Exemplarily, the text information to be converted "judge size: 20 minus 3 () 16 "converts to" judge size: 20 minus 3 is greater or less than 16 ".
In some embodiments, among the text information to be converted, is treated according to a preset regular matching rule "
Figure 277285DEST_PATH_IMAGE002
"make the transition. Exemplarily, when "
Figure 636723DEST_PATH_IMAGE004
"when both front and back are numbers, will"
Figure 699356DEST_PATH_IMAGE002
"why the conversion. In some examples,' A "
Figure 777034DEST_PATH_IMAGE002
The symbol may be included between "and the number. Illustratively, the text information "2, 3, 7, 4, 4, 9, 6, 5, 11,
Figure 153789DEST_PATH_IMAGE002
Figure 633311DEST_PATH_IMAGE002
Figure 70109DEST_PATH_IMAGE002
the "10, 7, 15" is converted to "2, 3, 7, 4, 4, 9, 6, 5, 11, what, 10, 7, 15".
In some embodiments, [ Mask ] may be used when converting "()" in the text information to be converted through a deep neural network](Mask) replace "()", exemplary, [ Mask](mask) may be "
Figure 431820DEST_PATH_IMAGE003
”。
In some embodiments, the text information to be converted is classified by feeding the text information to be converted into a trained deep neural network. Illustratively, the BERT model may be used to obtain words containing context information of the textual information to be converted in parallel by an attention mechanism, and in some examples, the context information may be extracted using an 8-headed self-attention mechanism. Illustratively, the information obtained by the BERT model may be represented by the following equation: encoderbert = BERT (text)
In some embodiments, the core of the deep neural network model may be a transform-like encoder. For example, the highest softmax probability may be selected as the classification mode group, and after determining the classification information, the text information to be converted is labeled with the corresponding label. In some examples, the classification information includes time, orientation, quantity, thing 4 classes.
Illustratively, when the classification information is time, "()" is converted into what; when the classification information is the azimuth, converting the '()' into which side; when the classification information is a quantity, "()" is converted into a few. When the classified information is things, judging whether the text information to be converted contains a subject or not, and when the text information to be converted does not contain the subject, converting the "()" into the reason; when the text information to be converted contains a subject, judging whether the subject of the text information to be converted is alive or not, converting the '()' into which the subject is alive when the text information to be converted contains the subject without the life, and converting the '()' into which the subject is alive when the text information to be converted contains the subject with the life; the mask is replaced with the translation result.
In some embodiments, the deep neural network model may be trained by using a LOSS (LOSS) function. Illustratively, the loss function may be a cross-entropy loss function in a multi-classification task. In some embodiments, the loss function may satisfy the following equation:
Figure 612266DEST_PATH_IMAGE005
in the formula: m represents the number of categories; yic denotes a sign function, taking 1 if the true class of sample i equals c, otherwise 0; pic denotes the predicted probability that the observed sample i belongs to class c.
In step S308 and step S310, the process of converting the blank symbols in the text information to be converted according to the result of the deep learning is performed according to that most of the texts with the blank symbols in the education scene are questioned sentences, so that the blank symbols are equivalent to questioned pronouns in the sentences. The number of modern Chinese query pronouns is mainly 16, which are respectively as follows: to ask things, time, places and quantity, there are mainly 8: who, what, where, time, place, and size; the main questions about the mode, character and reason are 8: what, why; there are 4 main word types: do, woo, bar, o, question adverbs mainly have 10: difficult to do, stay, sure, simple, strange, reverse, what taste, what must. In this study we mainly aimed at educational scenarios and through the analysis of text we found that the meaning of the blank symbols in the text can be summarized in the following categories: which, what, how many, who, how much.
Therefore, in the present alternative embodiment, the input text is classified by using a deep learning method, and the specific implementation is as shown in table 1, where the classification includes 4 categories of things, time, orientation, and quantity:
Figure 680716DEST_PATH_IMAGE006
Figure 288415DEST_PATH_IMAGE007
in step S312, the text information to be converted is converted into complete text information according to the classification conversion result of the canonical matching and the deep neural network.
In step S314, the complete text information is subjected to speech synthesis to generate audio information.
The embodiments of step S312 and step S314 are already explained in step S140 and step S150, and are not repeated herein.
In some embodiments, the symbol to be recognized in the topic in the educational scene is not in the preset regular matching rule, and the symbol to be recognized in the text information to be synthesized cannot be accurately converted only by the regular matching rule. Therefore, in the embodiment, after the characters to be recognized in the text information to be recognized are converted according to the regular matching rule, the symbols to be recognized, which do not conform to the regular matching rule and are not converted, are converted through the trained deep neural network model. The embodiment can ensure that the symbols to be recognized in the text information to be converted are completely converted, and further improve the accuracy of symbol conversion, thereby further improving the integrity of the voice synthesis result and expanding the application range of the disclosure.
The present embodiment further provides a flowchart of a method 500 for combining a regular matching rule and a deep neural network, where the method is used to implement the symbol conversion process of the foregoing embodiment, as shown in fig. 5, and includes:
step S501, inputting text information to be converted; the text information to be converted may include a symbol to be recognized.
Step S502, judging whether the text information to be converted includes the symbol to be identified; in some embodiments, the symbol to be recognized may be "
Figure 137422DEST_PATH_IMAGE001
"," _ "or" () "; and when the text information to be converted does not comprise the symbol to be identified, directly outputting the audio information corresponding to the text information to be converted.
Step S503, when the text information to be converted includes the symbol to be recognized, such as "
Figure 855979DEST_PATH_IMAGE001
When the text information is converted according to the obtained regular matching rule "
Figure 310095DEST_PATH_IMAGE001
”。
Step S504, after the question mark in the text information to be converted is subjected to regular matching, whether the text information to be converted also comprises a symbol to be identified is judged; wherein, the symbol to be recognized may be "_" or "()"; and when the text information to be converted does not comprise the symbol to be recognized, directly outputting the audio information corresponding to the text information to be converted.
And step S505, converting the _' in the text information to be converted according to the acquired regular matching rule.
Step S506, determining whether the text information to be converted further includes "_", and when the text information to be converted does not include "_", skipping step S507 and executing step S508.
Step S507, when the text information to be converted also comprises _ ', the _' which does not accord with the regular matching rule in the text information to be converted is converted through the deep neural network model.
Step S508, judge whether to include "()" in the text message to be converted; when the text information to be converted does not include "()", the text information to be converted is directly output.
Step S509, converts "()" in the text information to be converted according to the obtained regular matching rule.
Step S510, determining whether the text information to be converted further includes "()", and directly outputting the text information to be converted when the text information to be converted does not include "()".
And step S511, when the text information to be converted also comprises "()", converting the "()" which does not accord with the regular matching rule in the text information to be converted through the deep neural network model.
And S512, outputting the audio information corresponding to the converted text information.
The embodiment provides a method for combining the regular matching and the deep neural network, which can ensure that all the symbols to be recognized in the text information to be converted are converted, prevent the occurrence of the condition of missing the symbols to be recognized, and further improve the accuracy of speech synthesis.
In this embodiment, a speech synthesis apparatus is further provided, and the speech synthesis apparatus is used to implement the foregoing embodiments and preferred embodiments, which have already been described and are not described again. As used hereinafter, the term "module" is a combination of software and/or hardware that can implement a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
The present embodiment provides a speech synthesis apparatus 600, as shown in fig. 6, including:
a first obtaining module 610, configured to obtain text information to be converted; the text information to be converted comprises a symbol to be identified;
a second obtaining module 620, configured to obtain a preset regular matching rule;
a first conversion module 630, configured to convert the symbol to be recognized into text information according to the preset regular matching rule;
the second conversion module 640 is configured to convert the text information to be converted into complete text information according to the text information corresponding to the symbol to be recognized;
and the speech synthesis module 650 is configured to perform speech synthesis on the complete text information to generate audio information.
Optionally, the first conversion module is further configured to:
in said symbol to be recognized comprising'
Figure 88695DEST_PATH_IMAGE001
When it is used, it will "
Figure 362681DEST_PATH_IMAGE001
"converts to a question mark; alternatively, the first and second electrodes may be,
in said symbol to be recognized comprising "
Figure 884929DEST_PATH_IMAGE001
When, will "
Figure 990289DEST_PATH_IMAGE001
"convert to empty.
Optionally, when the symbol to be identified includes "_", the first conversion module is further configured to:
converting "_" to null upon detecting that the form including "_" portion is "first portion _ second portion"; alternatively, the first and second electrodes may be,
and when the text information to be converted comprises the horizontal line, converting the _' into the horizontal line.
Optionally, the first conversion module is further configured to:
when the symbol to be identified includes "()", converting "()" into a parenthesis; alternatively, the first and second electrodes may be,
in said symbol to be recognized comprising "
Figure 939790DEST_PATH_IMAGE002
"time, detect"
Figure 689354DEST_PATH_IMAGE002
"a position in the text information to be converted; indicating at said position'
Figure 749714DEST_PATH_IMAGE002
"when it is at the beginning of a sentence, will"
Figure 912842DEST_PATH_IMAGE002
"convert to who; indicating at said position'
Figure 829982DEST_PATH_IMAGE002
When "is in a sentence followed by" in ", it will"
Figure 344140DEST_PATH_IMAGE002
"transition to parentheses; indicating at said position'
Figure 208191DEST_PATH_IMAGE002
"is located at the end of sentence and"
Figure 960247DEST_PATH_IMAGE002
"when the front is a terminator, will"
Figure 517130DEST_PATH_IMAGE004
"convert to empty; indicating at said position'
Figure 253005DEST_PATH_IMAGE002
"is located at the end of sentence and"
Figure 717484DEST_PATH_IMAGE002
When the front face is equal in number, will "
Figure 589625DEST_PATH_IMAGE004
"how much to convert to; alternatively, the first and second electrodes may be,
in said symbol to be recognized comprising "
Figure 51830DEST_PATH_IMAGE002
And the text information to be converted includes "judge size", and "
Figure 275001DEST_PATH_IMAGE002
"when the front and back are numbers, will"
Figure 277592DEST_PATH_IMAGE002
"convert to greater than or less than;
in said symbol to be recognized comprising "
Figure 269819DEST_PATH_IMAGE002
", and"
Figure 902926DEST_PATH_IMAGE002
"when the front and back are numbers, will"
Figure 347814DEST_PATH_IMAGE002
"why the conversion.
Optionally, the symbol to be recognized includes a blank symbol; wherein the blank symbols comprise "
Figure 357358DEST_PATH_IMAGE002
"and/or" _ ", the apparatus further comprising:
the classification module is used for inputting the text information to be converted into a trained deep neural network model and classifying the text information to be converted to obtain classification information; wherein the classification information includes: things, time, orientation, quantity;
the third conversion module is used for converting the symbol to be recognized into text information according to the classification information;
the trained deep neural network model is obtained by training in the following way:
obtaining sample text information and a classification label corresponding to the sample text information;
and training a deep neural network model by using the sample text information and the classification labels to obtain the trained deep neural network model.
Optionally, the classification module comprises:
the replacing unit is used for replacing the blank symbols in the text information to be converted with masks;
the acquiring unit is used for acquiring the context information of the text information to be converted;
and the classification unit is used for classifying the text information to be converted through an attention mechanism to obtain the classification information.
Optionally, the classification module is further configured to:
converting the blank symbol into how many when the classification information includes time;
when the classification information comprises the direction, converting the blank symbol into which side;
converting the blank symbol into several when the classification information includes a quantity.
Optionally, when the classification information includes things, the third conversion module is further configured to:
detecting whether the subject of the text information to be converted is alive or not; converting the blank symbol into who when the subject is alive; converting the blank symbol into which when the subject is not alive. Alternatively, the first and second electrodes may be,
and detecting why the blank symbol is converted when the text information to be converted does not contain the subject. The speech synthesis apparatus in this embodiment is presented in the form of functional units, where a unit refers to an ASIC circuit, a processor and memory executing one or more software or fixed programs, and/or other devices that may provide the above-described functionality.
The embodiment provided by the disclosure can perform complete and accurate voice synthesis on the educational scene question. By combining the regular matching method and the deep learning method, the accuracy of a voice synthesis result is ensured, and the attention transition mechanism is introduced, so that the long text synthesis can be realized. By using the voice recognition method provided by the disclosure, a user can acquire all information of the text without reading, so that the time for reading the text by the user is saved. In the application of the educational scene, the learning efficiency can be further improved.
The complete steps 700 of the exemplary embodiment of the present disclosure in the application of the educational scenario can refer to fig. 7, which includes:
step S701, inputting a question text.
Step S702, inputting the question type text into a text analysis module.
And step S703, predicting the symbol through the text analysis module, and predicting the meaning corresponding to the symbol in the question text.
Step S704, for the _ "" in the question type text "
Figure 266408DEST_PATH_IMAGE002
"wait for blank symbol to predict.
Step S705, a complete semantic text is obtained according to the prediction result.
Step S706, the complete semantic text is input into a speech synthesis front-end module to perform word-sound conversion, so as to obtain a phoneme sequence.
In step S707, the front-end module inputs the phoneme sequence into the encoder, and encodes the output of the front-end module through the encoder.
In step S708, the attention module extracts information in the phoneme sequence in parallel.
In step S709, the attention transition module controls the attention to advance or stop at each time step of the encoder according to the output of the attention module.
In step S710, the decoder decodes the output of the encoder according to the output of the attention module.
In step S711, the decoder outputs a melpu result.
In step S712, the mel output from the decoder is input into the vocoder module.
In step S713, the vocoder generates audio according to mel.
The embodiment provides an exemplary complete process of applying the speech synthesis method disclosed by the disclosure in an educational scene, and the steps described in the embodiment can realize a speech synthesis function for a topic-type text in the educational scene.
An exemplary embodiment of the present disclosure also provides an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor. The memory stores a computer program executable by the at least one processor, the computer program, when executed by the at least one processor, is for causing the electronic device to perform a method according to an embodiment of the disclosure.
The disclosed exemplary embodiments also provide a non-transitory computer readable storage medium storing a computer program, wherein the computer program, when executed by a processor of a computer, is adapted to cause the computer to perform a method according to an embodiment of the present disclosure.
The exemplary embodiments of the present disclosure also provide a computer program product comprising a computer program, wherein the computer program, when executed by a processor of a computer, is adapted to cause the computer to perform a method according to an embodiment of the present disclosure.
Referring to fig. 8, a block diagram of a structure of an electronic device 800, which may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic device is intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 8, the electronic device 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The calculation unit 801, the ROM 802, and the RAM 803 are connected to each other via a bus 504. An input/output (I/O) interface 805 is also connected to bus 804.
A number of components in the electronic device 800 are connected to the I/O interface 805, including: an input unit 806, an output unit 807, a storage unit 808, and a communication unit 809. The input unit 806 may be any type of device capable of inputting information to the electronic device 800, and the input unit 806 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device. Output unit 807 can be any type of device capable of presenting information and can include, but is not limited to, a display, speakers, a video/audio output terminal, a vibrator, and/or a printer. The storage unit 804 may include, but is not limited to, a magnetic disk, an optical disk. The communication unit 809 allows the electronic device 800 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth (TM) devices, WiFi devices, WiMax devices, cellular communication devices, and/or the like.
Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 801 executes the respective methods and processes described above. For example, in some embodiments, method 100 and/or method 200 may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 808. In some embodiments, part or all of the computer program can be loaded and/or installed onto the electronic device 800 via the ROM 802 and/or the communication unit 809. In some embodiments, the computing unit 801 may be configured to perform the method 100 and/or the method 200 by any other suitable means (e.g., by means of firmware).
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
As used in this disclosure, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

Claims (11)

1. A method of speech synthesis comprising:
acquiring text information to be converted; the text information to be converted comprises a symbol to be identified;
acquiring a preset regular matching rule;
converting the symbol to be recognized into text information according to the preset regular matching rule;
converting the text information to be converted into complete text information according to the text information corresponding to the symbol to be recognized;
and carrying out voice synthesis on the complete text information to generate audio information.
2. The speech synthesis method according to claim 1, wherein converting the symbol to be recognized into text information according to the preset regular matching rule comprises:
in said symbol to be recognized comprising'
Figure 751417DEST_PATH_IMAGE002
When it is used, it will "
Figure 785449DEST_PATH_IMAGE002
"converts to a question mark; alternatively, the first and second electrodes may be,
in said symbol to be recognized comprising "
Figure 751131DEST_PATH_IMAGE002
When, will "
Figure 526189DEST_PATH_IMAGE002
"convert to empty.
3. The speech synthesis method according to claim 1, wherein, when the symbol to be recognized includes "_", converting the symbol to be recognized into text information according to the preset regular matching rule includes:
converting "_" to null upon detecting that the form including "_" portion is "first portion _ second portion"; alternatively, the first and second electrodes may be,
and when the text information to be converted comprises the horizontal line, converting the _' into the horizontal line.
4. The speech synthesis method according to claim 1, wherein converting the symbol to be recognized into text information according to the preset regular matching rule comprises:
when the symbol to be identified includes "()", converting "()" into a parenthesis; alternatively, the first and second electrodes may be,
in said symbol to be recognized comprising "
Figure 247021DEST_PATH_IMAGE003
"time, detect"
Figure 350106DEST_PATH_IMAGE003
"a position in the text information to be converted; indicating at said position'
Figure 260424DEST_PATH_IMAGE003
"when it is at the beginning of a sentence, will"
Figure 562092DEST_PATH_IMAGE003
"convert to who; indicating at said position'
Figure 657087DEST_PATH_IMAGE003
When "is in a sentence followed by" in ", it will"
Figure 637682DEST_PATH_IMAGE003
"transition to parentheses; indicating at said position'
Figure 945166DEST_PATH_IMAGE003
"is located at the end of sentence and"
Figure 366920DEST_PATH_IMAGE003
"when the front is a terminator, will"
Figure 508183DEST_PATH_IMAGE004
"convert to empty; indicating at said position'
Figure 320281DEST_PATH_IMAGE003
"is located at the end of sentence and"
Figure 87249DEST_PATH_IMAGE003
When the front face is equal in number, will "
Figure 566772DEST_PATH_IMAGE004
"how much to convert to; alternatively, the first and second electrodes may be,
in said symbol to be recognized comprising "
Figure 800307DEST_PATH_IMAGE003
And the text information to be converted includes "judge size", and "
Figure 240647DEST_PATH_IMAGE003
"when the front and back are numbers, will"
Figure 889934DEST_PATH_IMAGE003
"convert to greater than or less than;
in said symbol to be recognized comprising "
Figure 614176DEST_PATH_IMAGE003
", and"
Figure 18613DEST_PATH_IMAGE003
"when the front and back are numbers, will"
Figure 70882DEST_PATH_IMAGE003
"why the conversion.
5. The speech synthesis method according to claim 1, wherein the symbol to be recognized comprises a blank symbol; wherein the blank symbols comprise "
Figure 399227DEST_PATH_IMAGE003
"and/or" _ ", the method further comprising:
inputting the text information to be converted into a trained deep neural network model, and classifying the text information to be converted to obtain classified information; wherein the classification information includes: things, time, orientation, quantity;
converting the symbol to be recognized into text information according to the classification information;
the trained deep neural network model is obtained by training in the following way:
obtaining sample text information and a classification label corresponding to the sample text information;
and training a deep neural network model by using the sample text information and the classification labels to obtain the trained deep neural network model.
6. The speech synthesis method according to claim 5, wherein classifying the text information to be converted comprises:
replacing the blank symbols in the text information to be converted with masks;
acquiring context information of the text information to be converted;
and classifying the text information to be converted through an attention mechanism to obtain the classification information.
7. The speech synthesis method according to claim 5, wherein converting the symbol to be recognized into text information according to the classification information comprises:
converting the blank symbol into how many when the classification information includes time;
when the classification information comprises the direction, converting the blank symbol into which side;
converting the blank symbol into several when the classification information includes a quantity.
8. The speech synthesis method according to claim 5, wherein, when the classification information includes things, converting the symbol to be recognized into text information according to the classification information includes:
detecting whether the subject of the text information to be converted is alive or not; converting the blank symbol into who when the subject is alive; converting the blank symbol into which when the subject is not alive; alternatively, the first and second electrodes may be,
and detecting why the blank symbol is converted when the text information to be converted does not contain the subject.
9. A speech synthesis apparatus comprising:
the first acquisition module is used for acquiring text information to be converted; the text information to be converted comprises a symbol to be identified;
the second acquisition module is used for acquiring a preset regular matching rule;
the first conversion module is used for converting the symbol to be recognized into text information according to the preset regular matching rule;
the second conversion module is used for converting the text information to be converted into complete text information according to the text information corresponding to the symbol to be recognized;
and the voice synthesis module is used for carrying out voice synthesis on the complete text information to generate audio information.
10. An electronic device, comprising:
a processor; and
a memory for storing a program, wherein the program is stored in the memory,
wherein the program comprises instructions which, when executed by the processor, cause the processor to carry out the method according to any one of claims 1-8.
11. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-8.
CN202110893747.4A 2021-08-05 2021-08-05 Speech synthesis method, speech synthesis device, electronic equipment and computer-readable storage medium Active CN113345409B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110893747.4A CN113345409B (en) 2021-08-05 2021-08-05 Speech synthesis method, speech synthesis device, electronic equipment and computer-readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110893747.4A CN113345409B (en) 2021-08-05 2021-08-05 Speech synthesis method, speech synthesis device, electronic equipment and computer-readable storage medium

Publications (2)

Publication Number Publication Date
CN113345409A true CN113345409A (en) 2021-09-03
CN113345409B CN113345409B (en) 2021-11-26

Family

ID=77480695

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110893747.4A Active CN113345409B (en) 2021-08-05 2021-08-05 Speech synthesis method, speech synthesis device, electronic equipment and computer-readable storage medium

Country Status (1)

Country Link
CN (1) CN113345409B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113724686A (en) * 2021-11-03 2021-11-30 中国科学院自动化研究所 Method and device for editing audio, electronic equipment and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103761963A (en) * 2014-02-18 2014-04-30 大陆汽车投资(上海)有限公司 Method for processing text containing emotion information
CN107729310A (en) * 2016-08-11 2018-02-23 中兴通讯股份有限公司 A kind of extracting method of text message, device and mobile terminal
JP2018169434A (en) * 2017-03-29 2018-11-01 富士通株式会社 Voice synthesizer, voice synthesis method, voice synthesis system and computer program for voice synthesis
CN109326279A (en) * 2018-11-23 2019-02-12 北京羽扇智信息科技有限公司 A kind of method, apparatus of text-to-speech, electronic equipment and storage medium
CN109918658A (en) * 2019-02-28 2019-06-21 云孚科技(北京)有限公司 A kind of method and system obtaining target vocabulary from text
CN111104517A (en) * 2019-10-01 2020-05-05 浙江工商大学 Chinese problem generation method based on two triplets
CN112633004A (en) * 2020-11-04 2021-04-09 北京字跳网络技术有限公司 Text punctuation deletion method and device, electronic equipment and storage medium
CN112687258A (en) * 2021-03-11 2021-04-20 北京世纪好未来教育科技有限公司 Speech synthesis method, apparatus and computer storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103761963A (en) * 2014-02-18 2014-04-30 大陆汽车投资(上海)有限公司 Method for processing text containing emotion information
CN107729310A (en) * 2016-08-11 2018-02-23 中兴通讯股份有限公司 A kind of extracting method of text message, device and mobile terminal
JP2018169434A (en) * 2017-03-29 2018-11-01 富士通株式会社 Voice synthesizer, voice synthesis method, voice synthesis system and computer program for voice synthesis
CN109326279A (en) * 2018-11-23 2019-02-12 北京羽扇智信息科技有限公司 A kind of method, apparatus of text-to-speech, electronic equipment and storage medium
CN109918658A (en) * 2019-02-28 2019-06-21 云孚科技(北京)有限公司 A kind of method and system obtaining target vocabulary from text
CN111104517A (en) * 2019-10-01 2020-05-05 浙江工商大学 Chinese problem generation method based on two triplets
CN112633004A (en) * 2020-11-04 2021-04-09 北京字跳网络技术有限公司 Text punctuation deletion method and device, electronic equipment and storage medium
CN112687258A (en) * 2021-03-11 2021-04-20 北京世纪好未来教育科技有限公司 Speech synthesis method, apparatus and computer storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113724686A (en) * 2021-11-03 2021-11-30 中国科学院自动化研究所 Method and device for editing audio, electronic equipment and storage medium
US11462207B1 (en) 2021-11-03 2022-10-04 Institute Of Automation, Chinese Academy Of Sciences Method and apparatus for editing audio, electronic device and storage medium

Also Published As

Publication number Publication date
CN113345409B (en) 2021-11-26

Similar Documents

Publication Publication Date Title
CN111625634B (en) Word slot recognition method and device, computer readable storage medium and electronic equipment
CN109660865B (en) Method and device for automatically labeling videos, medium and electronic equipment
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
CN111310447A (en) Grammar error correction method, grammar error correction device, electronic equipment and storage medium
CN111027291B (en) Method and device for adding mark symbols in text and method and device for training model, and electronic equipment
CN114021582B (en) Spoken language understanding method, device, equipment and storage medium combined with voice information
CN113450758B (en) Speech synthesis method, apparatus, device and medium
CN113345409B (en) Speech synthesis method, speech synthesis device, electronic equipment and computer-readable storage medium
US20230034414A1 (en) Dialogue processing apparatus, learning apparatus, dialogue processing method, learning method and program
CN113743101B (en) Text error correction method, apparatus, electronic device and computer storage medium
CN113918031A (en) System and method for Chinese punctuation recovery using sub-character information
CN113761883A (en) Text information identification method and device, electronic equipment and storage medium
CN110516125B (en) Method, device and equipment for identifying abnormal character string and readable storage medium
US20230153550A1 (en) Machine Translation Method and Apparatus, Device and Storage Medium
CN116595023A (en) Address information updating method and device, electronic equipment and storage medium
CN111126059A (en) Method and device for generating short text and readable storage medium
CN114611529B (en) Intention recognition method and device, electronic equipment and storage medium
CN116432705A (en) Text generation model construction method, text generation device, equipment and medium
CN113723367B (en) Answer determining method, question judging method and device and electronic equipment
CN115631502A (en) Character recognition method, character recognition device, model training method, electronic device and medium
CN113342981A (en) Demand document classification method and device based on machine learning
CN113420121A (en) Text processing model training method, voice text processing method and device
CN111160042B (en) Text semantic analysis method and device
CN113722496B (en) Triple extraction method and device, readable storage medium and electronic equipment
CN113158656B (en) Ironic content recognition method, ironic content recognition device, electronic device, and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant