WO2022156654A1 - 一种文本数据处理方法及装置 - Google Patents
一种文本数据处理方法及装置 Download PDFInfo
- Publication number
- WO2022156654A1 WO2022156654A1 PCT/CN2022/072441 CN2022072441W WO2022156654A1 WO 2022156654 A1 WO2022156654 A1 WO 2022156654A1 CN 2022072441 W CN2022072441 W CN 2022072441W WO 2022156654 A1 WO2022156654 A1 WO 2022156654A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- output
- phoneme
- target
- hidden layer
- audio
- Prior art date
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 23
- 238000000034 method Methods 0.000 claims abstract description 177
- 230000008569 process Effects 0.000 claims abstract description 132
- 238000012545 processing Methods 0.000 claims abstract description 107
- 238000000605 extraction Methods 0.000 claims abstract description 92
- 238000013528 artificial neural network Methods 0.000 claims abstract description 73
- 230000000306 recurrent effect Effects 0.000 claims abstract description 22
- 230000015654 memory Effects 0.000 claims description 72
- 238000013140 knowledge distillation Methods 0.000 claims description 25
- 238000001228 spectrum Methods 0.000 claims description 19
- 238000013075 data extraction Methods 0.000 claims description 10
- 238000004590 computer program Methods 0.000 claims description 9
- 125000004122 cyclic group Chemical group 0.000 claims description 5
- 238000013473 artificial intelligence Methods 0.000 abstract description 14
- 238000004891 communication Methods 0.000 description 69
- 230000006854 communication Effects 0.000 description 69
- 230000006870 function Effects 0.000 description 53
- 239000013598 vector Substances 0.000 description 48
- 238000012549 training Methods 0.000 description 47
- 239000011159 matrix material Substances 0.000 description 23
- 238000010586 diagram Methods 0.000 description 21
- 230000015572 biosynthetic process Effects 0.000 description 20
- 238000005516 engineering process Methods 0.000 description 19
- 238000010295 mobile communication Methods 0.000 description 19
- 238000003786 synthesis reaction Methods 0.000 description 19
- 238000013527 convolutional neural network Methods 0.000 description 16
- 238000007726 management method Methods 0.000 description 15
- 230000005236 sound signal Effects 0.000 description 15
- 230000001537 neural effect Effects 0.000 description 12
- 210000002569 neuron Anatomy 0.000 description 12
- 238000004821 distillation Methods 0.000 description 10
- 238000004422 calculation algorithm Methods 0.000 description 9
- 238000004364 calculation method Methods 0.000 description 9
- 238000004458 analytical method Methods 0.000 description 8
- 230000004913 activation Effects 0.000 description 6
- 230000003287 optical effect Effects 0.000 description 6
- MHABMANUFPZXEB-UHFFFAOYSA-N O-demethyl-aloesaponarin I Natural products O=C1C2=CC=CC(O)=C2C(=O)C2=C1C=C(O)C(C(O)=O)=C2C MHABMANUFPZXEB-UHFFFAOYSA-N 0.000 description 5
- 230000009471 action Effects 0.000 description 5
- 230000002457 bidirectional effect Effects 0.000 description 5
- 230000033764 rhythmic process Effects 0.000 description 5
- 230000001133 acceleration Effects 0.000 description 4
- 230000008859 change Effects 0.000 description 4
- 238000013135 deep learning Methods 0.000 description 4
- 230000014509 gene expression Effects 0.000 description 4
- 229920001621 AMOLED Polymers 0.000 description 3
- 241000282412 Homo Species 0.000 description 3
- 230000003190 augmentative effect Effects 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 3
- 239000000872 buffer Substances 0.000 description 3
- 238000013461 design Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 3
- 230000018109 developmental process Effects 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 3
- 238000010606 normalization Methods 0.000 description 3
- 238000011022 operating instruction Methods 0.000 description 3
- 238000007781 pre-processing Methods 0.000 description 3
- 230000009466 transformation Effects 0.000 description 3
- 210000000988 bone and bone Anatomy 0.000 description 2
- 230000001413 cellular effect Effects 0.000 description 2
- 238000013500 data storage Methods 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 230000008447 perception Effects 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 239000002096 quantum dot Substances 0.000 description 2
- 230000005855 radiation Effects 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 238000013179 statistical model Methods 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 238000013519 translation Methods 0.000 description 2
- 241000282472 Canis lupus familiaris Species 0.000 description 1
- 241000282326 Felis catus Species 0.000 description 1
- 230000003416 augmentation Effects 0.000 description 1
- 238000005452 bending Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000007175 bidirectional communication Effects 0.000 description 1
- 238000013529 biological neural network Methods 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 230000019771 cognition Effects 0.000 description 1
- 230000001427 coherent effect Effects 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 239000004020 conductor Substances 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000011217 control strategy Methods 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000006073 displacement reaction Methods 0.000 description 1
- 230000002996 emotional effect Effects 0.000 description 1
- 238000005538 encapsulation Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 230000008570 general process Effects 0.000 description 1
- 230000005484 gravity Effects 0.000 description 1
- 230000003862 health status Effects 0.000 description 1
- 230000001939 inductive effect Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 239000010985 leather Substances 0.000 description 1
- 238000012886 linear function Methods 0.000 description 1
- 239000007788 liquid Substances 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000009877 rendering Methods 0.000 description 1
- 230000002441 reversible effect Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000003238 somatosensory effect Effects 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 230000006641 stabilisation Effects 0.000 description 1
- 238000011105 stabilization Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/096—Transfer learning
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L13/10—Prosody rules derived from text; Stress or intonation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/04—Details of speech synthesis systems, e.g. synthesiser structure or memory management
- G10L13/047—Architecture of speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L13/10—Prosody rules derived from text; Stress or intonation
- G10L2013/105—Duration
Definitions
- the present application relates to the field of artificial intelligence, and in particular, to a text data processing method and device.
- Artificial intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
- artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that responds in a similar way to human intelligence.
- Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
- the autoregressive processing of audio features within and between phonemes of text is implemented through a recurrent neural network (RNN).
- RNN recurrent neural network
- the output of the hidden layer is used to predict the speech data of the current frame.
- the audio synthesis speed of RNN is slow.
- the present application provides a text data processing method, including:
- the phonemes of the target text include adjacent first phonemes and second phonemes;
- a phone can also be called a phoneme, which is the smallest phonetic unit divided according to the natural properties of speech. According to the analysis of the pronunciation actions in the syllable, an action constitutes a phoneme. Phonemes are divided into vowels and consonants. For example, a Chinese syllable a (for example, one sound: ah) has only one phoneme, ai (for example, four sounds: love) has two phonemes, and dai (for example, one sound: dumb) has three phonemes, etc.;
- the target text can be preprocessed, and the target text can be processed into a sequence suitable for the input format of the TTS model.
- the server can perform text normalization on the target text, and the non-standard target text can be Convert the text to a pronounceable format, and perform word segmentation processing, segment sentences in the target text by word to resolve sentence ambiguity, and perform prosody analysis to predict the pause rhythm and/or stress of each sentence in the target text, etc., and Convert the words of the target text to the phoneme level to obtain a phoneme string (that is, a factor of the target text), and convert the phoneme string into a sequence format required by the TTS model (subsequent embodiments may be referred to as ID sequences);
- the phonemes of the target text may include adjacent first phonemes and second phonemes.
- the phoneme of the target text is a factor sequence in which a plurality of phonemes are arranged in a specific order, and the first phoneme and the second phoneme may be any two adjacent phonemes in the above factor sequence.
- An encoder (such as a convolutional neural network (CNN), a recurrent neural network (RNN), a network structure such as a transformer, or a hybrid network structure) can be used to perform feature extraction on the phonemes of the target text.
- the serial structure of the encoder may include, but is not limited to, a LUP layer (whose dimension is 512), 3 layers of filters with 512 convolution kernels and a kernel size of 5, and 1 layer of bidirectional recurrent neural network layers with 512 hidden layers.
- the encoder can be used to convert the phoneme of the target text into a hidden layer representation sequence (also called a feature vector), that is, the phoneme of the target text is mapped to the intermediate implicit representation H, and a feature vector will be generated for each phoneme.
- the feature vector contains rich phoneme contextual information.
- prosody prediction can be performed on the feature vectors obtained by the encoder to obtain audio features.
- the so-called parallel execution means that in the process of calculating the first voice data according to the first audio feature by the target RNN, the process of calculating the second voice data according to the second audio feature by the target RNN is also in progress.
- the target RNN after obtaining the output of the hidden layer in the process of calculating the first voice data, the target RNN starts the calculation of the second voice data, using the first target audio feature as the audio of the last frame in the first audio feature
- the feature and the second target audio feature are the audio features of the first frame in the second audio feature as an example, in one case, the process of calculating the voice data by the target RNN may include:
- the hidden layer begins to process the first target audio feature, the hidden layer calculates to obtain the output of the first sub-hidden layer, the output layer begins to process the output of the first sub-hidden layer, the output layer calculates the voice data, and the hidden layer begins to process the second sub-hidden layer.
- the hidden layer calculates the output of the second hidden layer, the output layer starts to process the output of the second hidden layer, and the output layer calculates the voice data; wherein, the start time of the hidden layer processing the second target audio feature can be in After the output layer starts to process the output of the first sub-hidden layer, before the output layer calculates the voice data, that is to say, the time when the target RNN calculates the first voice data may overlap with the time when the first voice data is calculated.
- the time overlap in the above situation is not considered to be that the target recurrent neural network RNN is determining the first voice data corresponding to the first phoneme and the second voice data corresponding to the second phoneme in parallel.
- the target RNN includes a hidden layer and an output layer
- the first audio feature and the second audio feature can be audio features of multiple frames
- the first target audio feature is the audio feature of the last frame in the first audio feature
- the second target audio feature is the audio feature of the first frame in the second audio feature as an example
- the process that the target RNN calculates the voice data may include:
- the hidden layer starts to process the first target audio feature, the hidden layer calculates to obtain the output of the first sub-hidden layer, the output layer starts to process the output of the first sub-hidden layer, and the output layer calculates to obtain the voice data;
- the hidden layer starts to process the second target audio feature, the hidden layer calculates to obtain the output of the second hidden layer, the output layer starts to process the output of the second hidden layer, and the output layer calculates to obtain the voice data;
- the so-called parallel refers to the time when the hidden layer of the target RNN starts to process the second target audio feature before the hidden layer calculates the output of the first sub-hidden layer.
- the hidden layer of the target RNN When does the containing layer start to process the second target audio feature, it does not depend on the hidden layer to complete the calculation of the output of the first sub-hidden layer, but depends on the acquisition time of the second target audio feature, when the second target audio feature is acquired After that, the hidden layer of the target RNN can directly start processing the second target audio feature;
- the target RNN processes the second audio feature and the processing time for processing the first audio feature.
- a certain overlap is also required to prevent the hidden layer of the target RNN from prematurely starting to process the second target audio feature, and the target RNN starts to process the first audio feature after processing the second audio feature.
- the input of the hidden layer of the RNN not only includes the input layer processing the audio features of the current frame.
- the output also includes the output of the hidden layer processing the audio features of the previous frame. Therefore, when the RNN processes the audio features of the next frame in two adjacent frames between different phonemes, it needs to wait for the hidden layer to process the audio features of the previous frame and obtain the output of the hidden layer before the current process can be performed.
- Frame audio feature processing; that is, the input of the target RNN to calculate the second voice data includes not only the second audio feature, but also the output of the hidden layer in the process of calculating the first voice data.
- the target RNN can start the calculation of the second voice data, so that the target RNN takes a long time to process the audio features.
- the target RNN processes the first audio features and the second audio features in parallel.
- the audio feature that is, the decoupling of the processing process of the first audio feature and the second audio feature is realized, which reduces the time for the target RNN to process the audio feature.
- the audio corresponding to the first phoneme and the second phoneme is acquired through a vocoder.
- the input of the hidden layer does not include the output of the audio features of the previous frame processed by the hidden layer.
- the RNN processes the audio features of the next frame in the two adjacent frames between different phonemes, it does not need to wait for the hidden layer to process the audio features of the previous frame and obtain the output of the hidden layer.
- Perform audio feature processing of the current frame That is to say, the hidden layer can be used to determine the output of the second sub-hidden layer according to the second audio feature before the output of the first sub-hidden layer is determined, thereby further reducing the RNN processing audio The time cost of the feature process.
- the target RNN includes a hidden layer and an output layer
- the target recurrent neural network RNN obtains the first voice data corresponding to the first phoneme according to the first audio feature
- the target RNN obtains the second voice data corresponding to the second phoneme according to the second audio feature, including:
- the second speech data is determined by the output layer according to the output of the second hidden layer, wherein in the process of determining the output of the second hidden layer by the hidden layer, the output of the first hidden layer does not act as a input to the hidden layer.
- the hidden layer may determine the output of the second sub-hidden layer according to the second target audio feature, and the output layer may determine the output of the second sub-hidden layer according to the second sub-hidden layer
- the output determines the second sub-speech data, and in the existing implementation, the hidden layer determines the output of the second sub-hidden layer according to the second target audio feature and the output of the first sub-hidden layer.
- the difference is: , in this embodiment, in the process of determining the output of the second sub-hidden layer by the hidden layer, the output of the first sub-hidden layer is not used as the input of the hidden layer; the first target audio features x t-1 and the second target audio feature x t are the audio features of adjacent frames of different phonemes.
- the input layer U of the RNN processes the second target audio feature x t to obtain
- the result can be used as the input of the hidden layer of the RNN, and the hidden layer output s t-1 obtained by the hidden layer of the RNN after processing the result obtained by processing the input layer U to process the first target audio feature x t is not used as the hidden layer of the RNN.
- the input of the hidden layer does not include the output of the audio features of the previous frame processed by the hidden layer, that is to say , for different phoneme units, the autoregressive method is not used between different phonemes, thereby reducing the computing power and processing time required by the RNN to process audio features.
- the embodiment of the present application does not limit that when the target RNN processes the second target audio feature, the input of the hidden layer of the target RNN only includes the result obtained by the input layer of the RNN processing the second target audio feature.
- the duration of the first phoneme is N frames, the number of the first audio features is N, and each audio feature in the N first audio features corresponds to one of the N frames, and the N
- the first audio feature includes a first target audio feature and a third target audio feature, and the frame corresponding to the first target audio feature is the frame before the frame corresponding to the third target audio feature;
- the first voice data includes the the first sub-voice data corresponding to the first target audio feature and the third sub-voice data corresponding to the third target audio feature;
- the determining, by the hidden layer, the output of the first hidden layer according to the first audio feature includes: determining, by the hidden layer, the output of the third sub-hidden layer according to the third target audio feature;
- the determining, by the output layer, the first speech data according to the output of the first hidden layer includes:
- the output layer determines the third sub-speech data according to the output of the third sub-hidden layer
- the output layer determines the first sub-speech data according to the output of the first sub-hidden layer.
- the hidden layer may determine the output of the third sub-hidden layer according to the third target audio feature. Specifically, the hidden layer may process the output of the third sub-hidden layer according to the input layer of the RNN. The output of the input layer obtained by the third target audio feature is used to determine the output of the third sub-hidden layer, and the output layer is configured to determine the third sub-speech data according to the output of the third sub-hidden layer.
- the third sub-voice data may be Mel spectrum MEL or Bark spectrum.
- the hidden layer may determine the output of the first sub-hidden layer according to the first target audio feature and the output of the third sub-hidden layer, and the output layer The first sub-speech data may be determined according to the output of the first hidden layer.
- the input of the hidden layer of the RNN includes not only the output of the input layer processing the audio features of the current frame, but also the output of the hidden layer processing the audio features of the previous frame, that is to say , for each phoneme unit, it is internally performed in an autoregressive manner.
- the first audio feature includes at least one of the following information: fundamental frequency information or energy information of the first phoneme
- the second audio feature includes at least one of the following information: the The fundamental frequency information or energy information of the second phoneme is described.
- the first voice data and the second voice data are Mel spectrum MEL or Bark spectrum.
- the target RNN is obtained by performing knowledge distillation on the student RNN according to the teacher RNN.
- the target RNN is obtained by performing knowledge distillation on the student RNN according to the teacher RNN and the first target loss; the first target loss indicates the difference between the first output and the second output ;in,
- the first output is the output of the output layer of the teacher RNN
- the second output is the output of the output layer of the student RNN
- the first output is the output of the middle layer of the teacher RNN
- the second output is the output of the middle layer of the student RNN.
- the performing feature extraction on the first phoneme and the second phoneme includes:
- the first phoneme and the second phoneme are processed through a target feature extraction network to obtain the first audio feature of the first phoneme and the second audio feature of the second phoneme;
- the target feature extraction The network is obtained by performing knowledge distillation on the student feature extraction network according to the teacher feature extraction network and the second target loss; the second target loss indicates the difference between the third output and the fourth output;
- the third output is the output of the output layer of the teacher feature extraction network
- the fourth output is the output of the output layer of the student feature extraction network
- the third output is the output of the middle layer of the teacher feature extraction network
- the fourth output is the output of the middle layer of the student feature extraction network.
- the present application provides a text data processing device, including:
- an acquisition module for acquiring target text
- the phonemes of the target text include adjacent first phonemes and second phonemes
- a feature extraction module configured to perform feature extraction on the first phoneme and the second phoneme to obtain a first audio feature of the first phoneme and a second audio feature of the second phoneme;
- a voice data extraction module configured to obtain the first voice data corresponding to the first phoneme through the target cyclic neural network RNN according to the first audio feature, and obtain the second voice data according to the second audio feature through the target RNN The second voice data corresponding to the phoneme; wherein, the steps of obtaining the first voice data corresponding to the first phoneme and the steps of obtaining the second voice data corresponding to the second phoneme are performed in parallel;
- An audio extraction module configured to acquire audio corresponding to the first phoneme and the second phoneme through a vocoder according to the first voice data and the second voice data.
- the target RNN includes a hidden layer and an output layer
- the speech data extraction module is configured to determine the output of the first hidden layer according to the first audio feature through the hidden layer
- the second speech data is determined by the output layer according to the output of the second hidden layer, wherein in the process of determining the output of the second hidden layer by the hidden layer, the output of the first hidden layer does not act as a input to the hidden layer.
- the duration of the first phoneme is N frames
- the number of the first audio features is N
- each audio feature in the N first audio features corresponds to the N frames.
- a frame the N first audio features include a first target audio feature and a third target audio feature
- the frame corresponding to the first target audio feature is the adjacent frame before the frame corresponding to the third target audio feature.
- the first voice data includes the first sub-voice data corresponding to the first target audio feature and the third sub-voice data corresponding to the third target audio feature;
- the voice data extraction module is used for the voice data extraction module.
- the output layer determines the third sub-speech data according to the output of the third sub-hidden layer
- the output layer determines the first sub-speech data according to the output of the first sub-hidden layer.
- the first audio feature includes at least one of the following information: fundamental frequency information or energy information of the first phoneme
- the second audio feature includes at least one of the following information: the The fundamental frequency information or energy information of the second phoneme is described.
- the first voice data and the second voice data are Mel spectrum MEL or Bark spectrum.
- the target RNN is obtained by performing knowledge distillation on the student RNN according to the teacher RNN.
- the target RNN is obtained by performing knowledge distillation on the student RNN according to the teacher RNN and the first target loss; the first target loss indicates the difference between the first output and the second output ;in,
- the first output is the output of the output layer of the teacher RNN
- the second output is the output of the output layer of the student RNN
- the first output is the output of the middle layer of the teacher RNN
- the second output is the output of the middle layer of the student RNN.
- the feature extraction module is configured to process the first phoneme and the second phoneme through a target feature extraction network to obtain the first audio feature of the first phoneme, and The second audio feature of the second phoneme;
- the target feature extraction network is obtained by performing knowledge distillation on the student feature extraction network according to the teacher feature extraction network and the second target loss;
- the second target loss indicates the third the difference between the output and the fourth output;
- the third output is the output of the output layer of the teacher feature extraction network
- the fourth output is the output of the output layer of the student feature extraction network
- the third output is the output of the middle layer of the teacher feature extraction network
- the fourth output is the output of the middle layer of the student feature extraction network.
- the present application provides a text data processing device, which may include a processor, the processor is coupled with a memory, the memory stores program instructions, and the above-mentioned first aspect is implemented when the program instructions stored in the memory are executed by the processor.
- a text data processing device which may include a processor, the processor is coupled with a memory, the memory stores program instructions, and the above-mentioned first aspect is implemented when the program instructions stored in the memory are executed by the processor.
- the present application provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, which, when executed on a computer, causes the computer to execute the method described in the first aspect.
- the present application provides a circuit system, the circuit system comprising a processing circuit configured to perform the method of the above-mentioned first aspect.
- the present application provides a computer program product, comprising code, which, when the code is run on a computer, causes the computer to execute the method described in the first aspect.
- the present application provides a chip system
- the chip system includes a processor for implementing the functions involved in the above aspects, for example, sending or processing the data and/or information involved in the above methods.
- the chip system further includes a memory for storing necessary program instructions and data of the server or the communication device.
- the chip system may be composed of chips, or may include chips and other discrete devices.
- An embodiment of the present application provides a text data processing method, including: acquiring target text, where phonemes of the target text include adjacent first phonemes and second phonemes; Feature extraction, to obtain the first audio feature of the first phoneme and the second audio feature of the second phoneme; obtain the first audio feature corresponding to the first phoneme through the target recurrent neural network RNN according to the first audio feature.
- the target RNN can process the first audio feature and the second audio feature in parallel, that is, the decoupling of the processing processes of the first audio feature and the second audio feature is realized, and the duration of processing the audio feature by the target RNN is reduced.
- Fig. 1 is a kind of structural schematic diagram of artificial intelligence main frame
- Figure 2 is a natural language processing system
- 3a is a schematic diagram of a server provided by an embodiment of the present application.
- 3b is a schematic diagram of an electronic device provided by an embodiment of the present application.
- FIG. 4 is a schematic diagram of a text data processing method provided by an embodiment of the present application.
- FIG. 5 is a schematic diagram of a text data processing method provided by an embodiment of the present application.
- FIG. 6 is a schematic diagram of a text data processing method provided by an embodiment of the present application.
- FIG. 7 is a schematic diagram of a software architecture of a text processing method provided by an embodiment of the present application.
- FIG. 8 is a schematic diagram of a software architecture of a text processing method provided by an embodiment of the present application.
- FIG. 9 is a schematic diagram of a software architecture of a text processing method provided by an embodiment of the present application.
- FIG. 10 is a schematic diagram of a text processing apparatus provided by an embodiment of the present application.
- FIG. 11 is a schematic structural diagram of an execution device provided by an embodiment of the application.
- FIG. 12 is a schematic structural diagram of a training device provided by an embodiment of the present application.
- FIG. 13 is a schematic structural diagram of a chip provided by an embodiment of the present application.
- At least one (item) refers to one or more, and "a plurality” refers to two or more.
- “And/or” is used to describe the relationship between related objects, indicating that there can be three kinds of relationships, for example, “A and/or B” can mean: only A, only B, and both A and B exist , where A and B can be singular or plural.
- the character “/” generally indicates that the associated objects are an “or” relationship.
- At least one item(s) below” or similar expressions thereof refer to any combination of these items, including any combination of single item(s) or plural items(s).
- At least one (a) of a, b or c can mean: a, b, c, "a and b", “a and c", “b and c", or "a and b and c" ", where a, b, c can be single or multiple.
- Figure 1 shows a schematic structural diagram of the main frame of artificial intelligence.
- the above-mentioned artificial intelligence theme framework is explained in two dimensions (vertical axis).
- the "intelligent information chain” reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, data has gone through the process of "data-information-knowledge-wisdom".
- the "IT value chain” reflects the value brought by artificial intelligence to the information technology industry from the underlying infrastructure of human intelligence, information (providing and processing technology implementation) to the industrial ecological process of the system.
- the infrastructure provides computing power support for artificial intelligence systems, realizes communication with the outside world, and supports through the basic platform. Communication with the outside world through sensors; computing power is provided by smart chips (hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA); the basic platform includes distributed computing framework and network-related platform guarantee and support, which can include cloud storage and computing, interconnection networks, etc. For example, sensors communicate with external parties to obtain data, and these data are provided to the intelligent chips in the distributed computing system provided by the basic platform for calculation.
- smart chips hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA
- the basic platform includes distributed computing framework and network-related platform guarantee and support, which can include cloud storage and computing, interconnection networks, etc. For example, sensors communicate with external parties to obtain data, and these data are provided to the intelligent chips in the distributed computing system provided by the basic platform for calculation.
- the data on the upper layer of the infrastructure is used to represent the data sources in the field of artificial intelligence.
- the data involves graphics, images, voice, and text, as well as IoT data from traditional devices, including business data from existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.
- Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making, etc.
- machine learning and deep learning can perform symbolic and formalized intelligent information modeling, extraction, preprocessing, training, etc. on data.
- Reasoning refers to the process of simulating human's intelligent reasoning method in a computer or intelligent system, using formalized information to carry out machine thinking and solving problems according to the reasoning control strategy, and the typical function is search and matching.
- Decision-making refers to the process of making decisions after intelligent information is reasoned, usually providing functions such as classification, sorting, and prediction.
- some general capabilities can be formed based on the results of data processing, such as algorithms or a general system, such as translation, text analysis, computer vision processing, speech recognition, image identification, etc.
- Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. They are the encapsulation of the overall solution of artificial intelligence, and the productization of intelligent information decision-making to achieve landing applications. Its application areas mainly include: intelligent terminals, intelligent transportation, Smart healthcare, autonomous driving, safe city, etc.
- FIG. 2 shows an exemplary schematic structural diagram of a communication system.
- the communication system includes a server 200 and an electronic device 100.
- the communication system may include one or more servers and each server One or more electronic devices may be included within the coverage range, which is not limited in this application.
- the communication system may further include other network entities such as a network controller, a switching device, etc., and the present application is not limited thereto.
- the bidirectional arrows in FIG. 2 indicate that there is a communication connection between the server and the electronic device, that is, data transmission can be implemented between the server and the electronic device through a communication network.
- the above-mentioned communication network may be a local area network, or may be a wide area network transferred through a relay (relay) device, or includes a local area network and a wide area network.
- the communication network is a local area network
- the communication network can be a near field communication network such as a wifi hotspot network, a wifi P2P network, a Bluetooth network, a zigbee network, or a near field communication (near field communication, NFC) network.
- the communication network may be a third-generation mobile communication technology (3rd-generation wireless telephone technology, 3G) network, a fourth-generation mobile communication technology (the 4th generation mobile communication technology, 4G) network ) network, the fifth generation mobile communication technology (5th-generation mobile communication technology, 5G) network, the future evolution of the public land mobile network (public land mobile network, PLMN) or the Internet, etc., which are not limited in this application.
- 3G third-generation mobile communication technology
- 4G fourth-generation mobile communication technology
- 5G fifth generation mobile communication technology
- PLMN public land mobile network
- the Internet etc., which are not limited in this application.
- the electronic device can obtain the target text input by the user, the electronic device can send the target text to the server side, the server can generate audio corresponding to the target text according to the target text, and the server can send the audio to the electronic device. equipment.
- the electronic device may acquire the target text input by the user, and generate audio corresponding to the target text according to the target text.
- FIG. 2 is only for the convenience of understanding, and schematically shows a communication system, but this should not constitute any limitation to the present application, and the communication system may also include a greater number of servers, and may also include a greater number of servers
- the number of servers communicating with different electronic devices may be the same server or different servers, and the number of servers communicating with different electronic devices may be the same or different, which is not limited in this application.
- the server in the communication system can be any device with a transceiver function or a chip that can be provided in the device.
- FIG. 3a shows an exemplary schematic structural diagram of the server 200. For the structure of the server 200, reference may be made to the structure shown in FIG. 3a.
- the server includes at least one processor 201 , at least one memory 202 and at least one network interface 203 .
- the processor 201, the memory 202 and the network interface 203 are connected, for example, through a bus. In this application, the connection may include various interfaces, transmission lines, or buses, which are not limited in this embodiment.
- the network interface 203 is used to connect the server with other communication devices through a communication link, such as an Ethernet interface.
- the processor 201 is mainly used to process communication data, control the entire server, execute software programs, and process data of the software programs, for example, to support the server to perform the actions described in the embodiments.
- the processor 201 is mainly used to control the entire server, execute software programs, and process data of the software programs.
- a server may include multiple processors to enhance its processing capability, and various components of the server may be connected through various buses.
- the processor 201 may also be expressed as a processing circuit or a processor chip.
- the memory 202 is mainly used to store software programs and data.
- the memory 202 may exist independently and be connected to the processor 201 .
- the memory 202 may be integrated with the processor 201, for example, in one chip.
- the memory 202 can store program codes for executing the technical solutions of the present application, and is controlled and executed by the processor 201 .
- Figure 3a shows only one memory and one processor. In an actual server, there may be multiple processors and multiple memories.
- the memory may also be referred to as a storage medium or a storage device or the like.
- the memory may be a storage element on the same chip as the processor, that is, an on-chip storage element, or an independent storage element, which is not limited in this application.
- the electronic equipment in the communication system can also be called user equipment (UE), which can be deployed on land, including indoor or outdoor, handheld or vehicle-mounted; it can also be deployed on water (such as ships, etc. ); can also be deployed in the air (eg on airplanes, balloons, satellites, etc.).
- Electronic devices can be mobile phones, tablet computers (pads), wearable devices with wireless communication functions (such as smart watches), location trackers with positioning functions, computers with wireless transceiver functions, virtual reality (virtual reality) , VR) equipment, augmented reality (augmented reality, AR) equipment, wireless equipment in a smart home (smart home), etc., which are not limited in this application.
- the aforementioned electronic devices and chips that can be provided in the aforementioned electronic devices are collectively referred to as electronic devices.
- Electronic devices in this application may include, but are not limited to: smart mobile phones, TVs, tablet computers, wristbands, Head Mount Display (HMD), augmented reality (AR) devices, mixed reality (mixed reality) reality, MR) devices, cellular phones, smart phones, personal digital assistants (PDAs), tablet computers, in-vehicle electronics, laptop computers, personal computers (personal computer, PC), monitoring equipment, robots, in-vehicle terminals, autonomous vehicles, etc.
- HMD Head Mount Display
- AR augmented reality
- MR mixed reality
- PDAs personal digital assistants
- tablet computers tablet computers
- in-vehicle electronics laptop computers
- personal computers personal computer, PC
- monitoring equipment robots
- robots in-vehicle terminals, autonomous vehicles, etc.
- FIG. 3b a specific structure is taken as an example to illustrate the structure of the electronic device provided in the present application.
- the electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charge management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2 , mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, headphone jack 170D, sensor module 180, buttons 190, motor 191, indicator 192, camera 193, display screen 194, and Subscriber identification module (subscriber identification module, SIM) card interface 195 and so on.
- SIM Subscriber identification module
- the sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, and ambient light.
- the structures illustrated in the embodiments of the present invention do not constitute a specific limitation on the electronic device 100 .
- the electronic device 100 may include more or less components than shown, or combine some components, or separate some components, or arrange different components.
- the illustrated components may be implemented in hardware, software, or a combination of software and hardware.
- the processor 110 may include one or more processing units, for example, the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), controller, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (neural-network processing unit, NPU), etc. Wherein, different processing units may be independent devices, or may be integrated in one or more processors.
- application processor application processor, AP
- modem processor graphics processor
- ISP image signal processor
- controller video codec
- digital signal processor digital signal processor
- baseband processor baseband processor
- neural-network processing unit neural-network processing unit
- the controller can generate an operation control signal according to the instruction operation code and timing signal, and complete the control of fetching and executing instructions.
- a memory may also be provided in the processor 110 for storing instructions and data.
- the memory in processor 110 is cache memory. This memory may hold instructions or data that have just been used or recycled by the processor 110 . If the processor 110 needs to use the instruction or data again, it can be called directly from the memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby increasing the efficiency of the system.
- the processor 110 may include one or more interfaces.
- the interface may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, a universal asynchronous transceiver (universal asynchronous transmitter) receiver/transmitter, UART) interface, mobile industry processor interface (MIPI), general-purpose input/output (GPIO) interface, subscriber identity module (SIM) interface, and / or universal serial bus (universal serial bus, USB) interface, etc.
- I2C integrated circuit
- I2S integrated circuit built-in audio
- PCM pulse code modulation
- PCM pulse code modulation
- UART universal asynchronous transceiver
- MIPI mobile industry processor interface
- GPIO general-purpose input/output
- SIM subscriber identity module
- USB universal serial bus
- the I2C interface is a bidirectional synchronous serial bus that includes a serial data line (SDA) and a serial clock line (SCL).
- the processor 110 may contain multiple sets of I2C buses.
- the processor 110 can be respectively coupled to the touch sensor 180K, the charger, the flash, the camera 193 and the like through different I2C bus interfaces.
- the processor 110 may couple the touch sensor 180K through the I2C interface, so that the processor 110 and the touch sensor 180K communicate with each other through the I2C bus interface, so as to realize the touch function of the electronic device 100 .
- the I2S interface can be used for audio communication.
- the processor 110 may contain multiple sets of I2S buses.
- the processor 110 may be coupled with the audio module 170 through an I2S bus to implement communication between the processor 110 and the audio module 170 .
- the audio module 170 can transmit audio signals to the wireless communication module 160 through the I2S interface, so as to realize the function of answering calls through a Bluetooth headset.
- the PCM interface can also be used for audio communications, sampling, quantizing and encoding analog signals.
- the audio module 170 and the wireless communication module 160 may be coupled through a PCM bus interface.
- the audio module 170 can also transmit audio signals to the wireless communication module 160 through the PCM interface, so as to realize the function of answering calls through the Bluetooth headset. Both the I2S interface and the PCM interface can be used for audio communication.
- the UART interface is a universal serial data bus used for asynchronous communication.
- the bus may be a bidirectional communication bus. It converts the data to be transmitted between serial communication and parallel communication.
- a UART interface is typically used to connect the processor 110 with the wireless communication module 160 .
- the processor 110 communicates with the Bluetooth module in the wireless communication module 160 through the UART interface to implement the Bluetooth function.
- the audio module 170 can transmit audio signals to the wireless communication module 160 through the UART interface, so as to realize the function of playing music through the Bluetooth headset.
- the MIPI interface can be used to connect the processor 110 with peripheral devices such as the display screen 194 and the camera 193 .
- MIPI interfaces include camera serial interface (CSI), display serial interface (DSI), etc.
- the processor 110 communicates with the camera 193 through a CSI interface, so as to realize the photographing function of the electronic device 100 .
- the processor 110 communicates with the display screen 194 through the DSI interface to implement the display function of the electronic device 100 .
- the GPIO interface can be configured by software.
- the GPIO interface can be configured as a control signal or as a data signal.
- the GPIO interface may be used to connect the processor 110 with the camera 193, the display screen 194, the wireless communication module 160, the audio module 170, the sensor module 180, and the like.
- the GPIO interface can also be configured as I2C interface, I2S interface, UART interface, MIPI interface, etc.
- the USB interface 130 is an interface that conforms to the USB standard specification, and may specifically be a Mini USB interface, a Micro USB interface, a USB Type C interface, and the like.
- the USB interface 130 can be used to connect a charger to charge the electronic device 100, and can also be used to transmit data between the electronic device 100 and peripheral devices. It can also be used to connect headphones to play audio through the headphones.
- the interface can also be used to connect other electronic devices, such as AR devices.
- the interface connection relationship between the modules illustrated in the embodiment of the present invention is only a schematic illustration, and does not constitute a structural limitation of the electronic device 100 .
- the electronic device 100 may also adopt different interface connection manners in the foregoing embodiments, or a combination of multiple interface connection manners.
- the charging management module 140 is used to receive charging input from the charger.
- the charger may be a wireless charger or a wired charger.
- the charging management module 140 may receive charging input from the wired charger through the USB interface 130 .
- the charging management module 140 may receive wireless charging input through a wireless charging coil of the electronic device 100 . While the charging management module 140 charges the battery 142 , it can also supply power to the electronic device through the power management module 141 .
- the power management module 141 is used for connecting the battery 142 , the charging management module 140 and the processor 110 .
- the power management module 141 receives input from the battery 142 and/or the charging management module 140, and supplies power to the processor 110, the internal memory 121, the display screen 194, the camera 193, and the wireless communication module 160.
- the power management module 141 can also be used to monitor parameters such as battery capacity, battery cycle times, battery health status (leakage, impedance).
- the power management module 141 may also be provided in the processor 110 .
- the power management module 141 and the charging management module 140 may also be provided in the same device.
- the wireless communication function of the electronic device 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modulation and demodulation processor, the baseband processor, and the like.
- Antenna 1 and Antenna 2 are used to transmit and receive electromagnetic wave signals.
- Each antenna in electronic device 100 may be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve antenna utilization.
- the antenna 1 can be multiplexed as a diversity antenna of the wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.
- the mobile communication module 150 may provide wireless communication solutions including 2G/3G/4G/5G etc. applied on the electronic device 100 .
- the mobile communication module 150 may include at least one filter, switch, power amplifier, low noise amplifier (LNA) and the like.
- the mobile communication module 150 can receive electromagnetic waves from the antenna 1, filter and amplify the received electromagnetic waves, and transmit them to the modulation and demodulation processor for demodulation.
- the mobile communication module 150 can also amplify the signal modulated by the modulation and demodulation processor, and then turn it into an electromagnetic wave for radiation through the antenna 1 .
- at least part of the functional modules of the mobile communication module 150 may be provided in the processor 110 .
- at least part of the functional modules of the mobile communication module 150 may be provided in the same device as at least part of the modules of the processor 110 .
- the modem processor may include a modulator and a demodulator.
- the modulator is used to modulate the low frequency baseband signal to be sent into a medium and high frequency signal.
- the demodulator is used to demodulate the received electromagnetic wave signal into a low frequency baseband signal. Then the demodulator transmits the demodulated low-frequency baseband signal to the baseband processor for processing.
- the low frequency baseband signal is processed by the baseband processor and passed to the application processor.
- the application processor outputs sound signals through audio devices (not limited to the speaker 170A, the receiver 170B, etc.), or displays images or videos through the display screen 194 .
- the modem processor may be a stand-alone device.
- the modem processor may be independent of the processor 110, and may be provided in the same device as the mobile communication module 150 or other functional modules.
- the wireless communication module 160 can provide applications on the electronic device 100 including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) networks), bluetooth (BT), global navigation satellites Wireless communication solutions such as global navigation satellite system (GNSS), frequency modulation (FM), near field communication (NFC), and infrared technology (IR).
- WLAN wireless local area networks
- BT Bluetooth
- GNSS global navigation satellite system
- FM frequency modulation
- NFC near field communication
- IR infrared technology
- the wireless communication module 160 may be one or more devices integrating at least one communication processing module.
- the wireless communication module 160 receives electromagnetic waves via the antenna 2 , frequency modulates and filters the electromagnetic wave signals, and sends the processed signals to the processor 110 .
- the wireless communication module 160 can also receive the signal to be sent from the processor 110 , perform frequency modulation on it, amplify it, and convert it into electromagnetic waves for radiation through the antenna 2 .
- the antenna 1 of the electronic device 100 is coupled with the mobile communication module 150, and the antenna 2 is coupled with the wireless communication module 160, so that the electronic device 100 can communicate with the network and other devices through wireless communication technology.
- the wireless communication technology may include but is not limited to: the fifth generation mobile communication technology (5th-Generation, 5G) system, the global system for mobile communications (global system for mobile communications, GSM), the general packet radio service (general packet radio service, GPRS), code division multiple access (code division multiple access, CDMA), wideband code division multiple access (WCDMA), time division code division multiple access (time-division code division multiple access, TD-SCDMA) ), long term evolution (LTE), Bluetooth (bluetooth), global navigation satellite system (the global navigation satellite system, GNSS), wireless fidelity (wireless fidelity, WiFi), near field communication (near field communication, NFC), FM (also known as FM radio), Zigbee, radio frequency identification (radio frequency identification, RFID) and/or infrared (infrared,
- the GNSS may include global positioning system (global positioning system, GPS), global navigation satellite system (global navigation satellite system, GLONASS), Beidou navigation satellite system (beidou navigation satellite system, BDS), quasi-zenith satellite system (quasi -zenith satellite system, QZSS) and/or satellite based augmentation systems (SBAS), etc.
- global positioning system global positioning system, GPS
- global navigation satellite system global navigation satellite system, GLONASS
- Beidou navigation satellite system beidou navigation satellite system, BDS
- quasi-zenith satellite system quadsi -zenith satellite system
- QZSS quasi-zenith satellite system
- SBAS satellite based augmentation systems
- the electronic device 100 may also include a wired communication module (not shown in FIG. 1 ), or the mobile communication module 150 or the wireless communication module 160 here may be replaced with a wired communication module (not shown in FIG. 1 ) out), the wired communication module can enable the electronic device to communicate with other devices through a wired network.
- the wired network may include, but is not limited to, one or more of the following: optical transport network (OTN), synchronous digital hierarchy (SDH), passive optical network (PON), Ethernet network (Ethernet), or flexible Ethernet (flex Ethernet, FlexE).
- the electronic device 100 implements a display function through a GPU, a display screen 194, an application processor, and the like.
- the GPU is a microprocessor for image processing, and is connected to the display screen 194 and the application processor.
- the GPU is used to perform mathematical and geometric calculations for graphics rendering.
- Processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.
- Display screen 194 is used to display images, videos, and the like.
- Display screen 194 includes a display panel.
- the display panel can be a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode or an active-matrix organic light-emitting diode (active-matrix organic light).
- LED diode AMOLED
- flexible light-emitting diode flexible light-emitting diode (flex light-emitting diode, FLED), Miniled, MicroLed, Micro-oLed, quantum dot light-emitting diode (quantum dot light emitting diodes, QLED) and so on.
- the electronic device 100 may include one or N display screens 194 , where N is a positive integer greater than one.
- the electronic device 100 may implement a shooting function through an ISP, a camera 193, a video codec, a GPU, a display screen 194, an application processor, and the like.
- the ISP is used to process the data fed back by the camera 193 .
- the shutter is opened, the light is transmitted to the camera photosensitive element through the lens, the light signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to the ISP for processing, and converts it into an image visible to the naked eye.
- ISP can also perform algorithm optimization on image noise, brightness, and skin tone.
- ISP can also optimize the exposure, color temperature and other parameters of the shooting scene.
- the ISP may be provided in the camera 193 .
- Camera 193 is used to capture still images or video.
- the object is projected through the lens to generate an optical image onto the photosensitive element.
- the photosensitive element may be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor.
- CMOS complementary metal-oxide-semiconductor
- the photosensitive element converts the optical signal into an electrical signal, and then transmits the electrical signal to the ISP to convert it into a digital image signal.
- the ISP outputs the digital image signal to the DSP for processing.
- DSP converts digital image signals into standard RGB camera, YUV and other formats of image signals.
- the electronic device 100 may include 1 or N cameras 193 , where N is a positive integer greater than 1.
- a digital signal processor is used to process digital signals, in addition to processing digital image signals, it can also process other digital signals. For example, when the electronic device 100 selects a frequency point, the digital signal processor is used to perform Fourier transform on the frequency point energy and so on.
- Video codecs are used to compress or decompress digital video.
- the electronic device 100 may support one or more video codecs.
- the electronic device 100 can play or record videos of various encoding formats, such as: Moving Picture Experts Group (moving picture experts group, MPEG) 1, MPEG2, MPEG3, MPEG4 and so on.
- MPEG Moving Picture Experts Group
- MPEG2 moving picture experts group
- MPEG3 MPEG4
- MPEG4 Moving Picture Experts Group
- the NPU is a neural-network (NN) computing processor.
- NN neural-network
- Applications such as intelligent cognition of the electronic device 100 can be implemented through the NPU, such as image recognition, face recognition, speech recognition, text understanding, and the like.
- the external memory interface 120 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the electronic device 100 .
- the external memory card communicates with the processor 110 through the external memory interface 120 to realize the data storage function. For example to save files like music, video etc in external memory card.
- Internal memory 121 may be used to store computer executable program code, which includes instructions.
- the internal memory 121 may include a storage program area and a storage data area.
- the storage program area can store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), and the like.
- the storage data area may store data (such as audio data, phone book, etc.) created during the use of the electronic device 100 and the like.
- the internal memory 121 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, universal flash storage (UFS), and the like.
- the processor 110 executes various functional applications and data processing of the electronic device 100 by executing instructions stored in the internal memory 121 and/or instructions stored in a memory provided in the processor.
- the electronic device 100 may implement audio functions through an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, an application processor, and the like. Such as music playback, recording, etc.
- the audio module 170 is used for converting digital audio information into analog audio signal output, and also for converting analog audio input into digital audio signal. Audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be provided in the processor 110 , or some functional modules of the audio module 170 may be provided in the processor 110 .
- Speaker 170A also referred to as a "speaker" is used to convert audio electrical signals into sound signals.
- the electronic device 100 can listen to music through the speaker 170A, or listen to a hands-free call.
- the receiver 170B also referred to as "earpiece" is used to convert audio electrical signals into sound signals.
- the voice can be answered by placing the receiver 170B close to the human ear.
- the microphone 170C also called “microphone” or “microphone” is used to convert sound signals into electrical signals.
- the user can make a sound near the microphone 170C through the human mouth, and input the sound signal into the microphone 170C.
- the electronic device 100 may be provided with at least one microphone 170C. In other embodiments, the electronic device 100 may be provided with two microphones 170C, which can implement a noise reduction function in addition to collecting sound signals. In other embodiments, the electronic device 100 may further be provided with three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, and implement directional recording functions.
- the earphone jack 170D is used to connect wired earphones.
- the earphone interface 170D may be the USB interface 130, or may be a 3.5mm open mobile terminal platform (OMTP) standard interface, a cellular telecommunications industry association of the USA (CTIA) standard interface.
- OMTP open mobile terminal platform
- CTIA cellular telecommunications industry association of the USA
- the pressure sensor 180A is used to sense pressure signals, and can convert the pressure signals into electrical signals.
- the pressure sensor 180A may be provided on the display screen 194 .
- the capacitive pressure sensor may be comprised of at least two parallel plates of conductive material. When a force is applied to the pressure sensor 180A, the capacitance between the electrodes changes.
- the electronic device 100 determines the intensity of the pressure according to the change in capacitance. When a touch operation acts on the display screen 194, the electronic device 100 detects the intensity of the touch operation according to the pressure sensor 180A.
- the electronic device 100 may also calculate the touched position according to the detection signal of the pressure sensor 180A.
- touch operations acting on the same touch position but with different touch operation intensities may correspond to different operation instructions. For example, when a touch operation whose intensity is less than the first pressure threshold acts on the short message application icon, the instruction for viewing the short message is executed. When a touch operation with a touch operation intensity greater than or equal to the first pressure threshold acts on the short message application icon, the instruction to create a new short message is executed.
- the gyro sensor 180B may be used to determine the motion attitude of the electronic device 100 .
- the angular velocity of electronic device 100 about three axes may be determined by gyro sensor 180B.
- the gyro sensor 180B can be used for image stabilization. Exemplarily, when the shutter is pressed, the gyro sensor 180B detects the shaking angle of the electronic device 100, calculates the distance that the lens module needs to compensate according to the angle, and allows the lens to offset the shaking of the electronic device 100 through reverse motion to achieve anti-shake.
- the gyro sensor 180B can also be used for navigation and somatosensory game scenarios.
- the air pressure sensor 180C is used to measure air pressure.
- the electronic device 100 calculates the altitude through the air pressure value measured by the air pressure sensor 180C to assist in positioning and navigation.
- the magnetic sensor 180D includes a Hall sensor.
- the electronic device 100 can detect the opening and closing of the flip holster using the magnetic sensor 180D.
- the electronic device 100 can detect the opening and closing of the flip according to the magnetic sensor 180D.
- characteristics such as automatic unlocking of the flip cover are set.
- the acceleration sensor 180E can detect the magnitude of the acceleration of the electronic device 100 in various directions (generally three axes). The magnitude and direction of gravity can be detected when the electronic device 100 is stationary.
- Distance sensor 180F for measuring distance.
- the electronic device 100 can measure the distance through infrared or laser. In some embodiments, when shooting a scene, the electronic device 100 can use the distance sensor 180F to measure the distance to achieve fast focusing.
- Proximity light sensor 180G may include, for example, light emitting diodes (LEDs) and light detectors, such as photodiodes.
- the light emitting diodes may be infrared light emitting diodes.
- the electronic device 100 emits infrared light to the outside through the light emitting diode.
- Electronic device 100 uses photodiodes to detect infrared reflected light from nearby objects.
- the electronic device 100 can use the proximity light sensor 180G to detect that the user holds the electronic device 100 close to the ear to talk, so as to automatically turn off the screen to save power. Proximity light sensor 180G can also be used in holster mode, pocket mode automatically unlocks and locks the screen.
- the ambient light sensor 180L is used to sense ambient light brightness. The electronic device 100 can adaptively adjust the brightness of the display screen 194 according to the perceived ambient light brightness. The ambient light sensor 180L can also be used to automatically adjust the white balance when taking pictures.
- the ambient light sensor 180L can also cooperate with the proximity light sensor 180G to detect whether the electronic device 100 is in a pocket, so as to prevent accidental touch.
- the fingerprint sensor 180H is used to collect fingerprints.
- the electronic device 100 can use the collected fingerprint characteristics to realize fingerprint unlocking, accessing application locks, taking pictures with fingerprints, answering incoming calls with fingerprints, and the like.
- the temperature sensor 180J is used to detect the temperature.
- Touch sensor 180K also called "touch device”.
- the touch sensor 180K may be disposed on the display screen 194 , and the touch sensor 180K and the display screen 194 form a touch screen, also called a “touch screen”.
- the touch sensor 180K is used to detect a touch operation on or near it.
- the touch sensor can pass the detected touch operation to the application processor to determine the type of touch event.
- Visual output related to touch operations may be provided through display screen 194 .
- the touch sensor 180K may also be disposed on the surface of the electronic device 100 , which is different from the location where the display screen 194 is located.
- the bone conduction sensor 180M can acquire vibration signals.
- the motion sensor 180N can be used to detect moving objects within the range captured by the camera, and collect the motion contours or motion trajectories of the moving objects.
- the motion sensor 180N may be an infrared sensor, a laser sensor, a dynamic vision sensor (DVS), etc.
- the DVS may specifically include DAVIS (Dynamic and Active-pixel Vision Sensor), ATIS (Asynchronous Time-based Image Sensor) ) or sensors such as CeleX sensors.
- DAVIS Dynamic and Active-pixel Vision Sensor
- ATIS Asynchronous Time-based Image Sensor
- CeleX sensors such as CeleX sensors.
- DVS draws on the properties of biological vision, where each pixel simulates a neuron that responds independently to relative changes in light intensity (hereafter referred to as "light intensity"). When the relative change in light intensity exceeds a threshold, the pixel outputs an event signal that includes the pixel's position, timestamp, and characteristic information about the light intensity.
- the keys 190 include a power-on key, a volume key, and the like. Keys 190 may be mechanical keys. It can also be a touch key.
- the electronic device 100 may receive key inputs and generate key signal inputs related to user settings and function control of the electronic device 100 .
- Motor 191 can generate vibrating cues.
- the indicator 192 can be an indicator light, which can be used to indicate the charging state, the change of the power, and can also be used to indicate a message, a missed call, a notification, and the like.
- the SIM card interface 195 is used to connect a SIM card.
- the SIM card can be contacted and separated from the electronic device 100 by inserting into the SIM card interface 195 or pulling out from the SIM card interface 195 .
- the electronic device 100 may support 1 or N SIM card interfaces, where N is a positive integer greater than 1.
- the SIM card interface 195 can support Nano SIM card, Micro SIM card, SIM card and so on. Multiple cards can be inserted into the same SIM card interface 195 at the same time. The types of the plurality of cards may be the same or different.
- the SIM card interface 195 can also be compatible with different types of SIM cards.
- the SIM card interface 195 is also compatible with external memory cards.
- the electronic device 100 interacts with the network through the SIM card to implement functions such as call and data communication.
- the electronic device 100 employs an eSIM, ie: an embedded SIM card.
- the eSIM card can be embedded in the electronic device 100 and cannot be separated from the electronic device 100 .
- the electronic device 300 can be logically divided into a hardware layer, an operating system 311, and an application layer.
- the hardware layer includes the application processor 301, MCU 302, memory 303, modem 304, Wi-Fi module 306, sensor 308, positioning module 310 and other hardware resources as described above. This application does not impose any limitation on the type of operating system carried by the electronic device 300 .
- a neural network can be composed of neural units, and a neural unit can refer to an operation unit that takes xs and intercept 1 as inputs, and the output of the operation unit can be:
- s 1, 2,...n, n is a natural number greater than 1
- Ws is the weight of xs
- b is the bias of the neural unit.
- f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of this activation function can be used as the input of the next convolutional layer.
- the activation function can be a sigmoid function.
- a neural network is a network formed by connecting many of the above single neural units together, that is, the output of one neural unit can be the input of another neural unit.
- the input of each neural unit can be connected with the local receptive field of the previous layer to extract the features of the local receptive field, and the local receptive field can be an area composed of several neural units.
- the work of each layer in a neural network can be expressed mathematically To describe: From the physical level, the work of each layer in the neural network can be understood as the transformation from the input space to the output space (that is, the row space of the matrix to the column space) through five operations on the input space (set of input vectors). ), these five operations include: 1. Dimension raising/lowering; 2. Enlarging/reducing; 3. Rotation; 4. Translation; 5. "Bending”. Among them, the operations of 1, 2, and 3 are determined by Complete, the operation of 4 is completed by +b, and the operation of 5 is realized by a().
- W is the weight vector, and each value in the vector represents the weight value of a neuron in the neural network of this layer.
- This vector W determines the space transformation from the input space to the output space described above, that is, the weight W of each layer controls how the space is transformed.
- the purpose of training the neural network is to finally obtain the weight matrix of all layers of the trained neural network (the weight matrix formed by the vectors W of many layers). Therefore, the training process of the neural network is essentially learning the way to control the spatial transformation, and more specifically, learning the weight matrix.
- Deep Neural Network also known as multi-layer neural network
- DNN Deep Neural Network
- the neural network inside DNN can be divided into three categories: input layer, hidden layer, and output layer.
- the first layer is the input layer
- the last layer is the output layer
- the middle layers are all hidden layers.
- the layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer.
- the coefficient from the kth neuron in the L-1 layer to the jth neuron in the Lth layer is defined as It should be noted that the input layer does not have a W parameter.
- more hidden layers allow the network to better capture the complexities of the real world.
- a model with more parameters is more complex and has a larger "capacity", which means that it can complete more complex learning tasks.
- Training the deep neural network is the process of learning the weight matrix, and its ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (the weight matrix formed by the vectors W of many layers).
- Convolutional Neural Network is a deep neural network with a convolutional structure.
- a convolutional neural network consists of a feature extractor consisting of convolutional and subsampling layers.
- the feature extractor can be viewed as a filter, and the convolution process can be viewed as convolution with an input image or a convolutional feature map using a trainable filter.
- the convolutional layer refers to the neuron layer in the convolutional neural network that convolves the input signal.
- a neuron can only be connected to some of its neighbors.
- a convolutional layer usually contains several feature planes, and each feature plane can be composed of some neural units arranged in a rectangle.
- Neural units in the same feature plane share weights, and the shared weights here are convolution kernels.
- Shared weights can be understood as the way to extract image information is independent of location. The underlying principle is that the statistics of one part of the image are the same as the other parts. This means that image information learned in one part can also be used in another part. So for all positions on the image, the same learned image information can be used.
- multiple convolution kernels can be used to extract different image information. Generally, the more convolution kernels, the richer the image information reflected by the convolution operation.
- the convolution kernel can be initialized in the form of a matrix of random size, and the convolution kernel can obtain reasonable weights by learning during the training process of the convolutional neural network.
- the immediate benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
- RNN Recurrent Neural Networks
- the layers are fully connected, and each node in each layer is unconnected.
- this ordinary neural network solves many problems, it is still powerless for many problems. For example, if you want to predict the next word of a sentence, you generally need to use the previous words, because the front and rear words in a sentence are not independent. The reason why RNN is called a recurrent neural network is that the current output of a sequence is also related to the previous output.
- RNN can process sequence data of any length.
- the training of RNN is the same as the training of traditional CNN or DNN.
- the error back-propagation algorithm is also used, but there is one difference: that is, if the RNN is expanded, the parameters, such as W, are shared; while the traditional neural network mentioned above is not the case.
- the output of each step depends not only on the network of the current step, but also on the state of the network in the previous steps.
- This learning algorithm is called Back propagation Through Time (BPTT), a time-based backpropagation algorithm.
- BPTT Back propagation Through Time
- the convolutional neural network can use the error back propagation (BP) algorithm to correct the size of the parameters in the initial super-resolution model during the training process, so that the reconstruction error loss of the super-resolution model becomes smaller and smaller.
- BP error back propagation
- the input signal is passed forward until the output will generate error loss, and the parameters in the initial super-resolution model are updated by back-propagating the error loss information, so that the error loss converges.
- the back-propagation algorithm is a back-propagation motion dominated by the error loss, aiming to obtain the parameters of the optimal super-resolution model, such as the weight matrix.
- the sound generator When the sound generator emits sound due to vibration, the sound can generally be decomposed into many simple sine waves, that is to say, all natural sounds are basically composed of many sine waves with different frequencies, and the sine wave with the lowest frequency is the fundamental tone. (that is, the fundamental frequency, represented by F0), while other sine waves with higher frequencies are overtones.
- Energy also known as intensity or volume
- prosody In the field of speech synthesis, prosody generally refers to features that control functions such as intonation, pitch, stress emphasis, pause, and rhythm. Prosody can reflect the emotional state of the speaker, the form of speech, etc.
- a vocoder is a sound signal processing module or software that encodes acoustic features into sound waveforms.
- the method provided by the present application is described below from the training side of the neural network and the application side of the neural network.
- the neural network training method provided in the embodiment of the present application involves the processing of natural language data, and can be specifically applied to data processing methods such as data training, machine learning, and deep learning, and performs symbolic and formalized intelligent information modeling on the training data. , extraction, preprocessing, training, etc., to finally obtain a trained text processing model (including a feature extraction model and a cyclic neural network RNN); and, the text data processing method provided in the embodiment of the present application can use the above-mentioned trained text processing model , input data (such as the target text in this application) into the trained text processing model to obtain output data (such as audio in this application).
- text processing model training method and text processing method provided in the embodiments of this application are inventions based on the same idea, and can also be understood as two parts in a system, or two stages of an overall process : such as model training phase and model application phase.
- FIG. 4 is a text data processing method provided by an embodiment of the present application. As shown in FIG. 4, the text data processing method provided by this embodiment includes:
- the execution subject of step 401 may be an electronic device. Specifically, the user may input the target text that needs to be converted into audio on the electronic device, and correspondingly, the electronic device may obtain the target text.
- the execution body of step 401 may be the server. Specifically, the user may input the target text to be converted into audio on the electronic device, the electronic device may send the target text to the server, and accordingly, the server may obtain the target text.
- the electronic device may display a text input box and a text input indication
- the target indication is used to instruct the user to input text that needs to be converted into audio in the text input box
- the electronic device may obtain the user's input in the text The target text entered in the input box.
- an application program that can generate audio corresponding to the target text according to the target text may be installed on the electronic device, the user may open the relevant application program, enter the target text that needs to be converted into audio in the application program, and then the electronic The device can generate the audio corresponding to the target text according to the target text, or send the target text to the server, and the server can generate the audio corresponding to the target text according to the target text.
- the target text may be processed according to a text-to-speech (TTS) model to obtain audio corresponding to the target text.
- TTS text-to-speech
- the target text after obtaining the target text, the target text may be preprocessed, and the target text may be processed into a sequence suitable for the input format of the TTS model.
- the server may perform text normalization on the target text. , convert the irregular target text into a pronounceable format, and perform word segmentation processing, segment the sentences in the target text according to words to resolve sentence ambiguity, and perform prosody analysis to predict the pause rhythm and rhythm of each sentence in the target text. / or accent, etc., and convert the words of the target text to phoneme levels to obtain a phoneme string (that is, the factor of the target text), and convert the phoneme string into the sequence format required by the TTS model (subsequent embodiments can be called ID sequence).
- a phone can also be referred to as a phoneme, which is the smallest phonetic unit divided according to the natural attributes of speech. According to the analysis of the pronunciation actions in the syllable, an action constitutes a phoneme. Phonemes are divided into vowels and consonants. For example, a Chinese syllable a (for example, Yisheng: ah) has only one phoneme, ai (for example, four tones: love) has two phonemes, dai (for example, Yisheng: dumb) has three phonemes, and so on.
- the target English text is "governments have made policy decisions"
- the phoneme of the target text is "G AH1V ER0 M AH0 N T HH AE1 V M EY1 D P AA1 L AH0 S IY0 D IH0 S IH1 ZH AH0 N Z ".
- the phonemes of the target Chinese text "what's the weather like today” are "j", “in”, “t”, “i”, “an”... .
- the phonemes of the target text may include adjacent first phonemes and second phonemes.
- the phoneme of the target text is a factor sequence in which a plurality of phonemes are arranged in a specific order, and the first phoneme and the second phoneme may be any two adjacent phonemes in the above factor sequence.
- M phonemes of the target text can be obtained, and the M phonemes can be processed through a neural network to obtain M feature vectors.
- the target text can be converted into a serialized ID (IDentity ID). ) sequence, each identifier in the ID sequence may correspond to one phoneme in the M phonemes, correspondingly, the ID sequence includes two adjacent identifiers, and the two adjacent identifiers correspond to the first phoneme and the first phoneme respectively. Two phonemes.
- IDentity ID serialized ID
- the execution subject of step 402 may be an electronic device or a server.
- the electronic device may acquire the target text and send the target text to the server, and the server may perform feature extraction on the first phoneme and the second phoneme to obtain the first audio feature of the first phoneme, and The second audio feature of the second phoneme; or, the electronic device may acquire the target text, and perform feature extraction on the first phoneme and the second phoneme to obtain the first audio feature of the first phoneme, and the second audio feature of the second phoneme; or, the server may obtain the target text, and perform feature extraction on the first phoneme and the second phoneme to obtain the first audio feature of the first phoneme, and a second audio feature of the second phoneme.
- feature extraction may be performed on the phonemes of the target text to obtain multiple audio features, wherein the multiple audio features include a first audio feature of the first phoneme and a second audio feature of the second phoneme.
- an encoder for example, a network structure such as a convolutional neural network (CNN), a recurrent neural network (RNN), a transformer, etc., or a hybrid network structure shown in FIG. 5
- CNN convolutional neural network
- RNN recurrent neural network
- FIG. 5 a hybrid network structure shown in FIG. 5
- the serial structure of the encoder may include, but is not limited to, a LUP layer (whose dimension is 512), 3 layers of filters with 512 convolution kernels and a kernel size of 5, and 1 layer of bidirectional recurrent neural network layers with 512 hidden layers.
- the encoder can be used to convert the phoneme of the target text into a hidden layer representation sequence (also called a feature vector), that is, the phoneme of the target text is mapped to the intermediate implicit representation H, and a feature vector will be generated for each phoneme.
- the feature vector contains rich phoneme contextual information.
- the encoder may encode the ID sequence corresponding to the phoneme of the target text into M feature vectors (or referred to as embedding vectors), wherein the feature vector may include abstract content information of the phoneme of the target text.
- prosody prediction can be performed on the feature vectors obtained by the encoder, wherein the prosody prediction can include three parts: Duration prediction, pitch prediction and energy prediction are used to characterize the duration information, fundamental frequency information and energy information of phonemes respectively.
- the three energy predictions can all be implemented using the same structure.
- the structure can include 2 convolution layers (384 3*1 convolution kernels) and a fully connected layer with 384 hidden layers.
- an additional convolution layer 512 9*1 convolution kernels
- the duration information may refer to the number of frames corresponding to each phoneme.
- the so-called number of frames corresponding to each phoneme refers to how many frames each phoneme is displayed in. For example, the first phoneme is used for presentation in N frames, and the second Phonemes are used for presentation within M frames.
- upsampling can be performed on the feature vector H, fundamental frequency information F and energy information E of each phoneme to obtain the feature vector H, fundamental frequency information F and energy information E of each frame of each phoneme.
- the audio feature of each frame of each phoneme can be determined according to the feature vector H, the fundamental frequency information F, and the energy information H of each frame of each phoneme.
- the sum of the feature vector H of each frame of each phoneme, the vector of fundamental frequency information F and energy information H can be used as the audio feature of each frame of each phoneme, and then the first audio feature of the first phoneme, and a second audio feature of the second phoneme.
- the above-mentioned audio features can be obtained according to at least one of the feature vector H, fundamental frequency information F and energy information H of each frame, and the way of obtaining it is also different. It is limited to the summation, for example, it may be the summation of the combination weights, or it can be obtained through other mathematical operations or neural networks, which is not limited in this application.
- the first audio feature of the first phoneme and the second audio feature of the second phoneme may be obtained, the first audio feature may include audio features of each frame corresponding to the first phoneme, and the second audio feature may include Audio features of each frame corresponding to the second phoneme, specifically, the first phoneme is used for presentation in N frames, the number of the first audio features is N, and each audio in the N first audio features The feature corresponds to one of the N frames; the second phoneme is used for presentation in M frames, the number of the second audio features is M, and each audio feature in the M second audio features corresponds to in one of the M frames.
- the target The recurrent neural network RNN determines in parallel the first speech data corresponding to the first phoneme and the second speech data corresponding to the second phoneme.
- the execution subject of step 503 may be an electronic device or a server.
- the electronic device may obtain the target text, and perform feature extraction on the first phoneme and the second phoneme, so as to obtain the first audio feature of the first phoneme and the second audio feature of the second phoneme feature, according to the first audio feature and the second audio feature, obtain the first voice data corresponding to the first phoneme and the second voice data corresponding to the second phoneme through the target recurrent neural network RNN;
- the electronic device can obtain the target text, and perform feature extraction on the first phoneme and the second phoneme, so as to obtain the first audio feature of the first phoneme and the second audio feature of the second phoneme.
- the first audio feature of the first phoneme and the second audio feature of the second phoneme are sent to the server, and the server can obtain the target recurrent neural network RNN according to the first audio feature and the second audio feature.
- the first voice data corresponding to the first phoneme and the second voice data corresponding to the second phoneme; or, the electronic device can obtain the target text, and send the target text to the server, and the server can compare the first phoneme and the second phoneme.
- the second phoneme performs feature extraction to obtain the first audio feature of the first phoneme and the second audio feature of the second phoneme, and according to the first audio feature and the second audio feature,
- the first voice data corresponding to the first phoneme and the second voice data corresponding to the second phoneme are acquired through the target recurrent neural network RNN.
- the first voice data and the second voice data are Mel spectrum MEL or Bark spectrum.
- the pre-trained RNN may be used to process the first audio feature of the first phoneme and the second audio feature of the second phoneme to obtain the first audio feature corresponding to the first phoneme. voice data and second voice data corresponding to the second phoneme.
- the pre-trained RNN can be used to process the phonemes of the target text to obtain the voice data of each phoneme of the target text, wherein the voice data of each phoneme includes the first voice data corresponding to the first phoneme and all the phonemes of the target text.
- the second voice data corresponding to the second phoneme is described.
- the process of processing the first audio feature of the first phoneme and the second audio feature of the second phoneme may also include other network structures other than RNN, which is not limited in this application.
- the audio features of phonemes can be processed by using the network structure of Taco2, such as including LSTM, linear projection and pre-net in sequence.
- the input of the hidden layer of the RNN not only includes the input layer processing the current frame.
- the output of the audio features also includes the output of the audio features of the previous frame processed by the hidden layer, that is, for each phoneme unit, the autoregressive method is used internally, and the autoregressive method is also used between different phonemes.
- the first target audio feature x t-1 and the second target audio feature x t are audio features of adjacent frames of different phonemes.
- the result obtained by the two target audio features x t can be used as the input of the hidden layer of the RNN.
- the hidden layer of the RNN processes the result obtained by the input layer processing the first target audio feature x t
- the hidden layer output s t- 1 is also used as the input to the hidden layer of the RNN. Equivalently, for each phoneme unit, the autoregressive method is used internally, and the autoregressive method is also used between different phonemes. The use of autoregressive methods between different phonemes will greatly increase the computing power and processing time required by the RNN to process audio features.
- the target RNN in order to reduce the computing power and processing time required by the RNN to process audio features, can process the first audio feature and the second audio feature in parallel; In the process of calculating the first voice data with the first audio feature, the process of calculating the second voice data according to the second audio feature is also in progress;
- the target RNN includes a hidden layer and an output layer
- the first audio feature and the second audio feature can be audio features of multiple frames
- the first target audio feature is the audio feature of the last frame in the first audio feature
- the second target audio feature is the audio feature of the first frame in the second audio feature as an example
- the process that the target RNN calculates the voice data may include:
- the hidden layer starts to process the first target audio feature, the hidden layer calculates to obtain the output of the first sub-hidden layer, the output layer starts to process the output of the first sub-hidden layer, and the output layer calculates to obtain the voice data;
- the hidden layer starts to process the second target audio feature, the hidden layer calculates to obtain the output of the second hidden layer, the output layer starts to process the output of the second hidden layer, and the output layer calculates to obtain the voice data;
- the so-called parallel refers to the time when the hidden layer of the target RNN starts to process the second target audio feature before the hidden layer calculates the output of the first sub-hidden layer.
- the hidden layer of the target RNN When does the containing layer start to process the second target audio feature, it does not depend on the hidden layer to complete the calculation of the output of the first sub-hidden layer, but depends on the acquisition time of the second target audio feature, when the second target audio feature is acquired After that, the hidden layer of the target RNN can directly start processing the second target audio feature;
- a non-autoregressive execution mode can be used between different phonemes.
- the duration of the first phoneme is N frames
- the number of the first audio features is N
- each audio feature in the N first audio features corresponds to one of the N frames
- the N first audio features include a first target audio feature
- the first target audio feature is the audio feature of the last frame in the N first audio features
- the second phoneme is used in M frames Demonstration
- the number of the second audio features is M
- each audio feature in the M second audio features corresponds to one frame in the M frames
- the M second audio features include the second target audio feature
- the second target audio feature is the audio feature of the positive first frame among the M second audio features. That is, the first target audio feature and the second target audio feature are audio features of adjacent frames of different phonemes.
- the N first audio features further include a third target audio feature
- the third target audio feature is the audio feature of the penultimate frame in the N first audio features, that is, the first target audio feature and the third target audio features are audio features of adjacent frames of the same phoneme.
- the hidden layer may determine the output of the third sub-hidden layer according to the third target audio feature. Specifically, the hidden layer may process the output of the third sub-hidden layer according to the input layer of the RNN. The output of the input layer obtained by the third target audio feature is used to determine the output of the third sub-hidden layer, and the output layer may determine the third sub-speech data according to the output of the third sub-hidden layer.
- the third sub-voice data may be Mel spectrum MEL or Bark spectrum.
- the hidden layer may determine the output of the first sub-hidden layer according to the first target audio feature and the output of the third sub-hidden layer, and the output layer The first sub-speech data may be determined according to the output of the first hidden layer.
- the input of the hidden layer of the RNN includes not only the output of the input layer processing the audio features of the current frame, but also the output of the hidden layer processing the audio features of the previous frame, that is to say , for each phoneme unit, it is internally performed in an autoregressive manner.
- the hidden layer may further determine the output of the second sub-hidden layer according to the second target audio feature, and the output layer may output the output according to the second sub-hidden layer Determine the second sub-speech data, and in the existing implementation, the hidden layer can determine the output of the second sub-hidden layer according to the second target audio feature and the output of the first sub-hidden layer.
- the hidden layer can determine the output of the second sub-hidden layer according to the second target audio feature and the output of the first sub-hidden layer. The difference is , in this embodiment, in the process of determining the output of the second sub-hidden layer by the hidden layer, the output of the first sub-hidden layer is not used as the input of the hidden layer.
- the first target audio feature x t-1 and the second target audio feature x t are the audio features of adjacent frames of different phonemes, when the RNN processes the second target audio feature x t , the RNN
- the result obtained by the input layer U of processing the second target audio feature x t can be used as the input of the hidden layer of the RNN, and the hidden layer of the RNN is obtained after processing the result obtained by the input layer U processing the first target audio feature x t .
- the hidden layer output s t-1 is not used as the input of the hidden layer of the RNN.
- the input of the hidden layer does not include the output of the audio features of the previous frame processed by the hidden layer, that is to say , for different phoneme units, the autoregressive method is not used between different phonemes, thereby reducing the computing power and processing time required by the RNN to process audio features.
- the embodiment of the present application does not limit that when the RNN processes the second target audio feature, the input of the hidden layer of the RNN only includes the result obtained by the input layer of the RNN processing the second target audio feature.
- each voice data can be spliced in the order of the number of frames to obtain a voice data processing result, and the voice data processing result can also be processed through a post-net (Post-net ) to compensate.
- Post-net post-net
- the hidden layer may determine the output of the second sub-hidden layer according to the second audio feature before determining the output of the first sub-hidden layer.
- the input of the hidden layer of the RNN not only includes the input layer processing the audio features of the current frame.
- the output also includes the output of the hidden layer processing the audio features of the previous frame. Therefore, when the RNN processes the audio features of the next frame in two adjacent frames between different phonemes, it needs to wait for the hidden layer to process the audio features of the previous frame and obtain the output of the hidden layer before the current process can be performed. Audio feature processing of frames.
- the input of the hidden layer does not include the output of the audio features of the previous frame processed by the hidden layer.
- the RNN processes the audio features of the next frame in the two adjacent frames between different phonemes, it does not need to wait for the hidden layer to process the audio features of the previous frame and obtain the output of the hidden layer.
- Perform audio feature processing of the current frame That is to say, the hidden layer can be used to determine the output of the second sub-hidden layer according to the second audio feature before the output of the first sub-hidden layer is determined, thereby further reducing the RNN processing audio The time cost of the feature process.
- the execution subject of step 404 may be an electronic device or a server.
- the first voice data and the second voice data may be used as the input of the vocoder to output audio.
- the voice data can be used as the input of the vocoder, and then the audio of the target text is output, and the audio includes the audio corresponding to the first phoneme and the second phoneme.
- An embodiment of the present application provides a text data processing method, including: acquiring target text, where phonemes of the target text include adjacent first phonemes and second phonemes; Feature extraction, to obtain the first audio feature of the first phoneme and the second audio feature of the second phoneme; obtain the first audio feature corresponding to the first phoneme through the target recurrent neural network RNN according to the first audio feature.
- the target RNN can process the first audio feature and the second audio feature in parallel, that is, the decoupling of the processing processes of the first audio feature and the second audio feature is realized, and the duration of processing the audio feature by the target RNN is reduced.
- TTS model including the RNN and the network for feature extraction
- the speech synthesis technology (that is, the technology that obtains the corresponding audio based on the target text) can be divided into cloud engine-based speech synthesis (referred to as “online speech synthesis”) and local engine-based speech synthesis (referred to as "" There are two types of offline speech synthesis”). Online speech has the characteristics of high naturalness, high real-time performance, and does not occupy client device resources, but its shortcomings are also obvious. Because the application (application, APP) using speech synthesis can be sent at one time A large piece of text is sent to the server, but the voice data synthesized by the server is sent back to the client where the above APP is installed in sections, and the amount of voice data is relatively large even after compression.
- offline synthesis can be separated from the dependence on the network, which can ensure the stability of the synthesis service and user privacy.
- Offline synthesis puts higher requirements on the model itself, which requires the model to run faster and be able to run in real time on terminal devices (such as mobile phones, speakers, large screens and other IoT devices), and the size of the model and software package takes up less storage space (for example, less than 30MB), it will not significantly increase the burden on the end-test equipment.
- the offline synthesis model should be similar to the sound quality of the cloud-side TTS, which will not bring about obvious user experience degradation.
- knowledge distillation can be used.
- the target RNN is obtained by performing knowledge distillation on the student RNN according to the teacher RNN.
- full model training can be performed, that is, the teacher TTS (including the teacher RNN and the teacher feature extraction network) with high data processing accuracy can be trained, and then the knowledge can be executed on the student TTS (including the student RNN and the student feature extraction network) based on the teacher TTS.
- Distillation training is performed to obtain the compressed TTS (including the target RNN and the target feature extraction network) in the embodiment of the present application.
- training loss includes but is not limited to the following three ways:
- the target RNN is obtained by performing knowledge distillation on the student RNN according to the teacher RNN and the first target loss; the first target loss indicates the difference between the first output and the second output ; wherein, the first output is the output of the output layer of the teacher RNN, and the second output is the output of the output layer of the student RNN.
- the loss can be constructed based on the speech data output by the output layer of the RNN (for example, the Mel spectrum or the BARK spectrum).
- the output layer of the RNN for example, the Mel spectrum or the BARK spectrum.
- MSD Mel-spectrogram-based distillation
- the target RNN is obtained by performing knowledge distillation on the student RNN according to the teacher RNN and the first target loss; the first target loss indicates the difference between the first output and the second output ; wherein, the first output is the output of the middle layer of the teacher RNN, and the second output is the output of the middle layer of the student RNN.
- the first phoneme and the second phoneme may be processed through a target feature extraction network to obtain the first audio feature of the first phoneme and the first audio feature of the second phoneme.
- Two audio features; the target feature extraction network is obtained by performing knowledge distillation on the student feature extraction network according to the teacher feature extraction network and the second target loss; between the second target loss and the third output and the fourth output.
- the third output is the output of the middle layer of the teacher feature extraction network
- the fourth output is the output of the middle layer of the student feature extraction network.
- an intermediate representation distillation (IRD) method based on intermediate feature representation may be used, and the loss used in knowledge distillation is:
- Ks and KT are the output of the middle layer of the student TTS model and the teacher TTS model (which can be the middle layer of the RNN or the middle layer of the feature extraction network)
- Wi is the i-th layer in the student TTS model.
- the parameters to be learned in order to force the output of each layer of the student's TTS model to be close enough to the output of the teacher's TTS model.
- the first phoneme and the second phoneme may be processed through a target feature extraction network to obtain the first audio feature of the first phoneme and the first audio feature of the second phoneme.
- Two audio features the target feature extraction network is obtained by performing knowledge distillation on the student feature extraction network according to the teacher feature extraction network and the second target loss; the second target loss indicates the difference between the third output and the fourth output
- the third output is the output of the output layer of the teacher feature extraction network, and the fourth output is the output of the output layer of the student feature extraction network.
- a prosody-based distillation method may be used to force the student TTS model to learn the prosody prediction result of the teacher's TTS model.
- the loss function used in knowledge distillation can be: in the formula and represent the second-order norm between the duration, pitch and energy predicted by the student TTS model and the teacher TTS model, respectively, and denote the weights of the last convolutional layer of the pitch and energy prediction module, and W f and W ⁇ denote the trainable matrices used to align the parameter dimensions.
- the embodiment of the present application proposes a model distillation method, which first trains a full teacher TTS model; then designs a student TTS model with a smaller model size, and can use a variety of distillation methods to train the student TTS model, including but not limited to Mel-based TTS models.
- this embodiment of the present application may include a text acquisition and processing module, which may be used to acquire target text to be processed.
- the target text is preprocessed.
- the preprocessing may include text analysis.
- the text analysis may be syntactic analysis to obtain text features.
- the text features may include but are not limited to: phoneme sequence, part of speech, word length, and prosodic pause. For details, reference may be made to the description of step 401 in the foregoing embodiment, which is not repeated here.
- the encoding module can be used to encode the processed text data to obtain the feature vector representation.
- the prosody prediction module can be used to predict duration, pitch and energy, wherein the prosody prediction module can include a duration prediction module, a pitch prediction module and an energy prediction module, and the duration prediction module can be used to make duration prediction according to the feature vector output by the encoding,
- the pitch prediction module can be used to make pitch prediction according to the feature vector output by the encoding, and the energy prediction module can be used to make energy prediction according to the feature vector output by the encoding.
- the encoding module may output audio features. For details, reference may be made to the description of step 402 in the foregoing embodiment, which is not repeated here.
- the autoregressive module can superimpose the three outputs of the duration module, the pitch module, and the energy module, and output the corresponding spectrogram features (referred to as voice data in the above embodiment) in an autoregressive manner.
- the description of step 403 will not be repeated here.
- the vocoder module can convert the output of the autoregressive module into a sound waveform (referred to as audio in the above embodiment). For details, reference may be made to the description of step 404 in the foregoing embodiment, which is not repeated here.
- the encoding module can encode the phoneme sequence (phoneme sequence X) of the input target text to obtain a hidden layer representation (Hidden representations H), and then perform prosodic prediction on the hidden layer representation ( Prosody injector), wherein the prosody prediction includes a duration prediction module, a pitch prediction module and an energy prediction module, and the output of the prosody prediction is an audio feature (Sum G), where the audio feature includes the audio features of each frame (g1,...,gn , ..., gN), followed by an autoregressive module (such as the serially connected LUP layer shown in Figure 8 (its dimension is 512), 3 layers of 512 filters with a kernel size of 5, and 1 hidden layer 512 bidirectional recurrent neural network layer) can process audio features to obtain voice data (Y1, ..., Yn, ..., YN), after compensating the voice data, the processed voice data can be obtained (for example, Figure 8 MEL spectrum shown in ).
- the audio feature includes the audio features of each frame (g1,...,gn
- a teacher TTS model (such as the teacher SAR acoustic model in Figure 9) can be obtained by training based on the TTS training corpus, and then the teacher TTS The model performs knowledge distillation to obtain the target TTS model (such as the small SAR acoustic model in Figure 9). Based on the target TTS model obtained by training, online speech synthesis can be performed. Specifically, the input text (that is, the above-mentioned embodiment) can be obtained.
- target text perform front-end processing on the acquired target text to obtain text features, and process the text features based on the target TTS model (depicted as acoustic feature decoding in Figure 9) to obtain speech data (described as acoustic features in Figure 9), And audio synthesis is performed based on acoustic features.
- FIG. 10 is a schematic diagram of a text processing apparatus 1000 provided by the embodiment of the present application.
- a text processing apparatus 1000 provided by an embodiment of the application includes:
- an acquisition module 1001 configured to acquire a target text, the phonemes of the target text include adjacent first phonemes and second phonemes;
- step 401 For the specific description of the obtaining module 1001, reference may be made to the description of step 401, which will not be repeated here.
- a feature extraction module 1002 configured to perform feature extraction on the first phoneme and the second phoneme to obtain a first audio feature of the first phoneme and a second audio feature of the second phoneme;
- step 402 For the specific description of the feature extraction module 1002, reference may be made to the description of step 402, which will not be repeated here.
- the voice data extraction module 1003 is used to obtain the first voice data corresponding to the first phoneme through the target cyclic neural network RNN according to the first audio feature, and obtain the first voice data according to the second audio feature through the target RNN.
- An audio extraction module 1004 configured to acquire audio corresponding to the first phoneme and the second phoneme through a vocoder according to the first voice data and the second voice data.
- step 404 For the specific description of the audio extraction module 1004, reference may be made to the description of step 404, which will not be repeated here.
- the target RNN includes a hidden layer and an output layer
- the speech data extraction module is configured to determine the output of the first hidden layer according to the first audio feature through the hidden layer
- the second speech data is determined by the output layer according to the output of the second hidden layer, wherein in the process of determining the output of the second hidden layer by the hidden layer, the output of the first hidden layer does not act as a input to the hidden layer.
- the duration of the first phoneme is N frames
- the number of the first audio features is N
- each audio feature in the N first audio features corresponds to the N frames.
- a frame the N first audio features include a first target audio feature and a third target audio feature
- the frame corresponding to the first target audio feature is the adjacent frame before the frame corresponding to the third target audio feature.
- the first voice data includes the first sub-voice data corresponding to the first target audio feature and the third sub-voice data corresponding to the third target audio feature;
- the voice data extraction module is used for the voice data extraction module.
- the output layer determines the third sub-speech data according to the output of the third sub-hidden layer
- the output layer determines the first sub-speech data according to the output of the first sub-hidden layer.
- the first audio feature includes at least one of the following information: fundamental frequency information or energy information of the first phoneme
- the second audio feature includes at least one of the following information: the The fundamental frequency information or energy information of the second phoneme is described.
- the first voice data and the second voice data are Mel spectrum MEL or Bark spectrum.
- the target RNN is obtained by performing knowledge distillation on the student RNN according to the teacher RNN.
- the target RNN is obtained by performing knowledge distillation on the student RNN according to the teacher RNN and the first target loss; the first target loss indicates the difference between the first output and the second output ;in,
- the first output is the output of the output layer of the teacher RNN
- the second output is the output of the output layer of the student RNN
- the first output is the output of the middle layer of the teacher RNN
- the second output is the output of the middle layer of the student RNN.
- the feature extraction module is configured to process the first phoneme and the second phoneme through a target feature extraction network to obtain the first audio feature of the first phoneme, and The second audio feature of the second phoneme;
- the target feature extraction network is obtained by performing knowledge distillation on the student feature extraction network according to the teacher feature extraction network and the second target loss;
- the second target loss indicates the third the difference between the output and the fourth output;
- the third output is the output of the output layer of the teacher feature extraction network
- the fourth output is the output of the output layer of the student feature extraction network
- the third output is the output of the middle layer of the teacher feature extraction network
- the fourth output is the output of the middle layer of the student feature extraction network.
- FIG. 11 is a schematic structural diagram of the execution device provided by the embodiment of the present application. Smart wearable devices, servers, etc., are not limited here.
- the data processing apparatus described in the embodiment corresponding to FIG. 11 may be deployed on the execution device 1100 to implement the data processing function in the embodiment corresponding to FIG. 11 .
- the execution device 1100 includes: a receiver 1101, a transmitter 1102, a processor 1103, and a memory 1104 (wherein the number of processors 1103 in the execution device 1100 may be one or more, and one processor is taken as an example in FIG. 11 ) , wherein the processor 1103 may include an application processor 11031 and a communication processor 11032 .
- the receiver 1101, the transmitter 1102, the processor 1103, and the memory 1104 may be connected by a bus or otherwise.
- Memory 1104 may include read-only memory and random access memory, and provides instructions and data to processor 1103 . A portion of memory 1104 may also include non-volatile random access memory (NVRAM).
- NVRAM non-volatile random access memory
- the memory 1104 stores processors and operating instructions, executable modules or data structures, or a subset thereof, or an extended set thereof, wherein the operating instructions may include various operating instructions for implementing various operations.
- the processor 1103 controls the operation of the execution device.
- various components of the execution device are coupled together through a bus system, where the bus system may include a power bus, a control bus, a status signal bus, and the like in addition to a data bus.
- the various buses are referred to as bus systems in the figures.
- the methods disclosed in the above embodiments of the present application may be applied to the processor 1103 or implemented by the processor 1103 .
- the processor 1103 may be an integrated circuit chip with signal processing capability. In the implementation process, each step of the above-mentioned method can be completed by an integrated logic circuit of hardware in the processor 1103 or an instruction in the form of software.
- the above-mentioned processor 1103 may be a general-purpose processor, a digital signal processor (DSP), a microprocessor or a microcontroller, and may further include an application specific integrated circuit (ASIC), a field programmable Field-programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
- DSP digital signal processor
- ASIC application specific integrated circuit
- FPGA field programmable Field-programmable gate array
- the processor 1103 may implement or execute the methods, steps, and logical block diagrams disclosed in the embodiments of this application.
- a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
- the steps of the method disclosed in conjunction with the embodiments of the present application may be directly embodied as executed by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor.
- the software modules may be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media mature in the art.
- the storage medium is located in the memory 1104, and the processor 1103 reads the information in the memory 1104, and completes the steps of the above method in combination with its hardware.
- the receiver 1101 can be used to receive input numerical or character information, and generate signal input related to the relevant settings and function control of the execution device.
- the transmitter 1102 can be used to output digital or character information through the first interface; the transmitter 1102 can also be used to send instructions to the disk group through the first interface to modify the data in the disk group; the transmitter 1102 can also include a display device such as a display screen .
- the processor 1103 is configured to execute the text data processing method in the embodiment corresponding to FIG. 4 .
- FIG. 12 is a schematic structural diagram of the training device provided by the embodiment of the present application.
- the training device 1200 is implemented by one or more servers, and the training device 1200 Can vary widely by configuration or performance, and can include one or more central processing units (CPUs) 1212 (eg, one or more processors) and memory 1232, one or more storage applications Storage medium 1230 (eg, one or more mass storage devices) for program 1242 or data 1244.
- the memory 1232 and the storage medium 1230 may be short-term storage or persistent storage.
- the program stored in the storage medium 1230 may include one or more modules (not shown in the figure), and each module may include a series of instructions to operate on the training device. Further, the central processing unit 1212 may be configured to communicate with the storage medium 1230 to execute a series of instruction operations in the storage medium 1230 on the training device 1200 .
- the training device 1200 may also include one or more power supplies 1226, one or more wired or wireless network interfaces 1250, one or more input and output interfaces 1258; or, one or more operating systems 1241, such as Windows ServerTM, Mac OS XTM , UnixTM, LinuxTM, FreeBSDTM and so on.
- operating systems 1241 such as Windows ServerTM, Mac OS XTM , UnixTM, LinuxTM, FreeBSDTM and so on.
- the training device may perform the steps related to model training in the foregoing embodiments.
- Embodiments of the present application also provide a computer program product that, when running on a computer, causes the computer to perform the steps performed by the aforementioned execution device, or causes the computer to perform the steps performed by the aforementioned training device.
- Embodiments of the present application further provide a computer-readable storage medium, where a program for performing signal processing is stored in the computer-readable storage medium, and when it runs on a computer, the computer executes the steps performed by the aforementioned execution device. , or, causing the computer to perform the steps as performed by the aforementioned training device.
- the execution device, training device, or terminal device provided in this embodiment of the present application may specifically be a chip, and the chip includes: a processing unit and a communication unit, the processing unit may be, for example, a processor, and the communication unit may be, for example, an input/output interface, pins or circuits, etc.
- the processing unit can execute the computer executable instructions stored in the storage unit, so that the chip in the execution device executes the data processing method described in the above embodiments, or the chip in the training device executes the data processing method described in the above embodiment.
- the storage unit is a storage unit in the chip, such as a register, a cache, etc.
- the storage unit may also be a storage unit located outside the chip in the wireless access device, such as only Read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (RAM), etc.
- ROM Read-only memory
- RAM random access memory
- FIG. 13 is a schematic structural diagram of a chip provided by an embodiment of the application.
- the chip may be represented as a neural network processor NPU 1300, and the NPU 1300 is mounted as a co-processor to the main CPU (Host CPU), tasks are allocated by the Host CPU.
- the core part of the NPU is the arithmetic circuit 1303, which is controlled by the controller 1304 to extract the matrix data in the memory and perform multiplication operations.
- the arithmetic circuit 1303 includes multiple processing units (Process Engine, PE). In some implementations, the arithmetic circuit 1303 is a two-dimensional systolic array. The arithmetic circuit 1303 may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, arithmetic circuit 1303 is a general-purpose matrix processor.
- the operation circuit fetches the data corresponding to the matrix B from the weight memory 1302 and buffers it on each PE in the operation circuit.
- the arithmetic circuit fetches the data of matrix A and matrix B from the input memory 1301 to perform matrix operation, and stores the partial result or final result of the matrix in the accumulator 1308 .
- Unified memory 1306 is used to store input data and output data.
- the weight data is directly passed through the storage unit access controller (Direct Memory Access Controller, DMAC) 1305, and the DMAC is transferred to the weight memory 1302.
- Input data is also moved to unified memory 1306 via the DMAC.
- DMAC Direct Memory Access Controller
- the BIU is the Bus Interface Unit, that is, the bus interface unit 1310, which is used for the interaction between the AXI bus and the DMAC and the instruction fetch buffer (Instruction Fetch Buffer, IFB) 1309.
- IFB Instruction Fetch Buffer
- the bus interface unit 1310 (Bus Interface Unit, BIU for short) is used for the instruction fetch memory 1309 to obtain instructions from the external memory, and also for the storage unit access controller 1305 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
- the DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 1306 , the weight data to the weight memory 1302 , or the input data to the input memory 1301 .
- the vector calculation unit 1307 includes a plurality of operation processing units, and further processes the output of the operation circuit 1303, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc., if necessary. It is mainly used for non-convolutional/fully connected layer network computation in neural networks, such as Batch Normalization, pixel-level summation, and upsampling of feature planes.
- vector computation unit 1307 can store the processed output vectors to unified memory 1306 .
- the vector calculation unit 1307 can apply a linear function; or a nonlinear function to the output of the operation circuit 1303, such as performing linear interpolation on the feature plane extracted by the convolution layer, such as a vector of accumulated values, to generate activation values.
- the vector computation unit 1307 generates normalized values, pixel-level summed values, or both.
- the vector of processed outputs can be used as an activation input to the arithmetic circuit 1303, such as for use in subsequent layers in a neural network.
- the instruction fetch memory (instruction fetch buffer) 1309 connected to the controller 1304 is used to store the instructions used by the controller 1304;
- the unified memory 1306, the input memory 1301, the weight memory 1302 and the instruction fetch memory 1309 are all On-Chip memories. External memory is private to the NPU hardware architecture.
- the processor mentioned in any one of the above may be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the above program.
- the device embodiments described above are only schematic, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be A physical unit, which can be located in one place or distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
- the connection relationship between the modules indicates that there is a communication connection between them, which may be specifically implemented as one or more communication buses or signal lines.
- the computer program product includes one or more computer instructions.
- the computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.
- the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be retrieved from a website, computer, training device, or data Transmission from the center to another website site, computer, training facility or data center via wired (eg coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg infrared, wireless, microwave, etc.) means.
- wired eg coaxial cable, fiber optic, digital subscriber line (DSL)
- wireless eg infrared, wireless, microwave, etc.
- the computer-readable storage medium can be any available medium that can be stored by a computer, or a data storage device such as a training device, a data center, or the like that contains one or more available media integrated.
- the usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVD), or semiconductor media (eg, Solid State Disk (SSD)), and the like.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Human Computer Interaction (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Signal Processing (AREA)
- Machine Translation (AREA)
- Document Processing Apparatus (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
一种文本数据处理方法,应用于人工智能领域,包括:获取目标文本,目标文本的音素包括相邻的第一音素和第二音素(401);对第一音素和第二音素进行特征提取,以获取第一音素的第一音频特征、以及第二音素的第二音频特征(402);通过目标循环神经网络RNN根据第一音频特征获取第一音素对应的第一语音数据,通过目标RNN根据第二音频特征获取第二音素对应的第二语音数据;获取第一音素对应的第一语音数据和获取第二音素对应的第二语音数据的步骤并行执行(403);根据第一语音数据和第二语音数据,通过声码器获取第一音素和第二音素对应的音频(404)。目标RNN可以并行处理第一音频特征和第二音频特征,即实现了第一音频特征和第二音频特征的处理过程的解耦,减少了目标RNN处理音频特征的时长。
Description
本申请要求于2021年1月22日提交中国专利局、申请号为202110091046.9、发明名称为“一种文本数据处理方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请涉及人工智能领域,尤其涉及一种文本数据处理方法及装置。
人工智能(artificial intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个分支,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式作出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。
随着多媒体通信技术的不断发展,作为人机通信重要方式之一的语音合成以其方便、快捷的优点收到研究者的广泛关注,其中,文本转语音(text to speech,TTS)可以将文本转换为对应的音频。随着深度学习近年来得到了长足的发展,文本转语音技术已经从基于简单统计模型(如hidden markov model,HMM)等统计模型的参数化语音合成逐渐转入到基于深度神经网络模型的端到端音频合成。
在现有的实现中,通过循环神经网络(recurrent neural network,RNN)实现文本的音素内以及音素间的音频特征自回归处理,所谓自回归处理是指基于RNN处理前一帧的音频特征得到的隐含层输出来预测当前帧的语音数据,然而,由于自回归的迭代输出特性,使得RNN的音频合成速度较慢。
发明内容
第一方面,本申请提供了一种文本数据处理方法,包括:
获取目标文本,所述目标文本的音素包括相邻的第一音素和第二音素;
音素(phone)也可以称之为发声音素,是根据语音的自然属性划分出来的最小语音单位,依据音节里的发音动作来分析,一个动作构成一个音素。音素分为元音与辅音两大类。例如,汉语音节a(例如,一声:啊)只有一个音素,ai(例如四声:爱)有两个音素,dai(例如一声:呆)有三个音素等;
在获取到目标文本之后,可以对目标文本进行预处理,将目标文本处理为适配于TTS模型输入格式的序列,示例性的,服务器可以对目标文本进行文本归一化,将不规范的目标文本转为可发音格式,并进行分词处理,按词语为单位分割目标文本中的句子,来解决句子歧义性,并进行韵律分析,预测目标文本中各个句子的停顿节奏和/或重音等,并将目标文本的字转换为音素级别,以得到音素串(也就是目标文本的因素),并将音素串转换为TTS 模型需要的序列格式(后续实施例可以称之为ID序列);
目标文本的音素可以包括相邻的第一音素和第二音素。目标文本的音素是由多个音素按照特定的顺序进行排列的因素序列,第一音素和第二音素可以是上述因素序列中任意相邻的两个音素。
对所述第一音素和所述第二音素进行特征提取,以获取所述第一音素的第一音频特征、以及所述第二音素的第二音频特征;
可以利用编码器(例如卷积神经网络(convolutional neural networks,CNN)、循环神经网络(recurrent neural network,RNN)、transformer等网络结构或混合网络结构)对目标文本的音素进行特征提取。例如,编码器的串行结构可以但不限于包括LUP层(其维度为512)、3层512个卷积核kernel大小为5的filter及1层隐含层为512的双向循环神经网络层。可以利用编码器将目标文本的音素转化为隐含层表示序列(也可以称之为特征向量),即将目标文本的音素映射成中间隐式表示H,对于每个音素将生成一个特征向量,这些特征向量包含丰富的音素上下环境信息。为了能够得到可以包括更丰富的特征的音频特征,在利用编码器将目标文本的音素转化为特征向量之后,可以对编码器得到的特征向量进行韵律预测,以得到音频特征。
通过目标循环神经网络RNN根据所述第一音频特征获取所述第一音素对应的第一语音数据,通过所述目标RNN根据所述第二音频特征获取所述第二音素对应的第二语音数据;其中,所述获取所述第一音素对应的第一语音数据和所述获取所述第二音素对应的第二语音数据的步骤并行执行;
其中,所谓并行执行,是指在通过目标RNN根据第一音频特征计算第一语音数据的过程中,也在进行着通过目标RNN根据第二音频特征计算第二语音数据的过程。
在现有的实现中,在得到计算第一语音数据过程中隐含层的输出之后,目标RNN开始第二语音数据的计算,以第一目标音频特征为第一音频特征中最后一帧的音频特征、第二目标音频特征为第二音频特征中第一帧的音频特征为例,在一种情况下,目标RNN计算语音数据的过程可以包括:
隐含层开始处理第一目标音频特征,隐含层计算得到第一子隐含层输出,输出层开始处理第一子隐含层输出,输出层计算得到语音数据,隐含层开始处理第二目标音频特征,隐含层计算得到第二隐含层输出,输出层开始处理第二隐含层输出,输出层计算得到语音数据;其中,隐含层处理第二目标音频特征的开始时间可以在输出层开始处理第一子隐含层输出之后,输出层计算得到语音数据之前,也就是说,也可能出现目标RNN计算第一语音数据的时间与计算第一语音数据的时间重叠,本申请中,上述情况的时间重叠并不认为是目标循环神经网络RNN在并行确定所述第一音素对应的第一语音数据以及所述第二音素对应的第二语音数据。
本申请中,目标RNN包括隐含层以及输出层,且第一音频特征和第二音频特征可以为多帧的音频特征,以第一目标音频特征为第一音频特征中最后一帧的音频特征、第二目标音频特征为第二音频特征中第一帧的音频特征为例,目标RNN计算语音数据的过程可以包括:
隐含层开始处理第一目标音频特征,隐含层计算得到第一子隐含层输出,输出层开始处理第一子隐含层输出,输出层计算得到语音数据;
隐含层开始处理第二目标音频特征,隐含层计算得到第二隐含层输出,输出层开始处理第二隐含层输出,输出层计算得到语音数据;
本实施例中,所谓并行,是指目标RNN的隐含层开始处理第二目标音频特征的时间在隐含层计算得到第一子隐含层输出之前,换一种表述方式,目标RNN的隐含层什么时候开始处理第二目标音频特征,并不依赖于隐含层完成第一子隐含层输出的计算,而依赖于第二目标音频特征的获取时间,在获取到第二目标音频特征之后,目标RNN的隐含层就可以直接开始处理第二目标音频特征;
应理解,除了目标RNN的隐含层开始处理第二目标音频特征的时间在隐含层计算得到第一子隐含层输出之前,目标RNN处理第二音频特征和处理第一音频特征的处理时间还需要存在一定的重叠,以免出现目标RNN的隐含层过早的开始处理第二目标音频特征,目标RNN在处理完第二音频特征之后,才开始处理第一音频特征的情况。
在现有的实现中,在通过RNN处理音频特征以获取语音数据的过程中,针对于不同音素之间的相邻帧,RNN的隐含层的输入不仅包括输入层处理当前帧的音频特征的输出,还包括隐含层处理上一帧的音频特征的输出。因此,RNN在处理不同音素之间的两个相邻帧中靠后的一帧的音频特征时,需要等待隐含层处理上一帧的音频特征并得到隐含层输出之后,才可以进行当前帧的音频特征处理;也就是说,目标RNN计算第二语音数据的输入不仅包括第二音频特征,还包括计算第一语音数据过程中隐含层的输出,也就是说,只有在得到计算第一语音数据过程中隐含层的输出,目标RNN才可以开始第二语音数据的计算,使得目标RNN处理音频特征的时间较长,本实施例中,目标RNN并行处理第一音频特征和第二音频特征,即实现了第一音频特征和第二音频特征的处理过程的解耦,减少了目标RNN处理音频特征的时长。
根据所述第一语音数据和所述第二语音数据,通过声码器获取所述第一音素和所述第二音素对应的音频。
本申请实施例中,由于RNN在处理不同音素之间的两个相邻帧中靠后的一帧的音频特征时,隐含层的输入不包括隐含层处理上一帧的音频特征的输出,进而,RNN在处理不同音素之间的两个相邻帧中靠后的一帧的音频特征时,不需要等待隐含层处理上一帧的音频特征并得到隐含层输出之后,就可以进行当前帧的音频特征处理。也就是说,隐含层可以用于在确定出所述第一子隐含层输出之前,就根据所述第二音频特征确定所述第二子隐含层输出,从而进一步降低了RNN处理音频特征过程的时间开销。
在一种可能的实现中,所述目标RNN包括隐含层和输出层,所述通过目标循环神经网络RNN根据所述第一音频特征获取所述第一音素对应的第一语音数据,通过所述目标RNN根据所述第二音频特征获取所述第二音素对应的第二语音数据,包括:
通过所述隐含层根据所述第一音频特征确定第一隐含层输出;
通过所述输出层根据所述第一隐含层输出确定所述第一语音数据;
通过所述隐含层根据所述第二音频特征确定第二隐含层输出;
通过所述输出层根据所述第二隐含层输出确定所述第二语音数据,其中,所述隐含层确定第二隐含层输出的过程中,所述第一隐含层输出不作为所述隐含层的输入。
在通过目标RNN处理第二音频特征的过程中,所述隐含层可以根据所述第二目标音频特征确定第二子隐含层输出,所述输出层可以根据所述第二子隐含层输出确定所述第二子语音数据,和现有的实现中,所述隐含层根据所述第二目标音频特征以及第一子隐含层输出来确定第二子隐含层输出不同的是,本实施例中,在所述隐含层确定所述第二子隐含层输出的过程中,所述第一子隐含层输出不作为所述隐含层的输入;第一目标音频特征x
t-1和第二目标音频特征x
t为不同音素的相邻帧的音频特征,RNN在处理第二目标音频特征x
t时,RNN的输入层U处理第二目标音频特征x
t得到的结果可以作为RNN的隐含层的输入,同时RNN的隐含层在处理输入层U处理第一目标音频特征x
t得到的结果后得到的隐含层输出s
t-1不作为RNN的隐含层的输入。相当于,RNN在处理不同音素之间的两个相邻帧中靠后的一帧的音频特征时,隐含层的输入不包括隐含层处理上一帧的音频特征的输出,也就是说,对于不同的音素单位,不同音素间不采用自回归方式执行,从而降低了RNN处理音频特征时所需的算力开销以及处理时间。
应理解,本申请实施例并不限定目标RNN在处理第二目标音频特征时,目标RNN的隐含层的输入仅包括RNN的输入层处理第二目标音频特征得到的结果。
在一种可能的实现中,
所述第一音素的时长为N帧,所述第一音频特征的数量为N,且N个第一音频特征中的每个音频特征对应于所述N帧中的一帧,所述N个第一音频特征包括第一目标音频特征和第三目标音频特征,所述第一目标音频特征对应的帧为所述第三目标音频特征对应的帧之前的帧;所述第一语音数据包括所述第一目标音频特征对应的第一子语音数据以及所述第三目标音频特征对应的第三子语音数据;
所述通过所述隐含层根据所述第一音频特征确定第一隐含层输出包括:通过所述隐含层根据所述第三目标音频特征确定第三子隐含层输出;
通过所述隐含层根据所述第一目标音频特征和所述第三子隐含层输出确定第一子隐含层输出;
所述通过所述输出层根据所述第一隐含层输出确定所述第一语音数据包括:
所述输出层根据所述第三子隐含层输出确定所述第三子语音数据,
所述输出层根据所述第一子隐含层输出确定所述第一子语音数据。
在RNN处理第三音频特征的过程中,所述隐含层可以根据所述第三目标音频特征确定第三子隐含层输出,具体的,所述隐含层可以根据RNN的输入层处理所述第三目标音频特征得到的输入层输出来确定第三子隐含层输出,所述输出层用于根据所述第三子隐含层输出确定所述第三子语音数据。其中,第三子语音数据可以为梅尔频谱MEL或巴克谱Bark。
在RNN处理第一音频特征的过程中,所述隐含层可以根据所述第一目标音频特征和所述第三子隐含层输出确定所述第一子隐含层输出,所述输出层可以根据所述第一隐含层输出确定所述第一子语音数据。相当于,针对于同一个音素的各个帧,RNN的隐含层的输入不仅包括输入层处理当前帧的音频特征的输出,还包括隐含层处理上一帧的音频特征的输 出,也就是说,对于每个音素单位,其内部采用自回归方式执行。
在一种可能的实现中,所述第一音频特征包括如下信息的至少一种:所述第一音素的基频信息或能量信息,所述第二音频特征包括如下信息的至少一种:所述第二音素的基频信息或能量信息。
在一种可能的实现中,所述第一语音数据以及所述第二语音数据为梅尔频谱MEL或巴克谱Bark。
在一种可能的实现中,所述目标RNN为根据老师RNN对学生RNN进行知识蒸馏得到的。
在一种可能的实现中,所述目标RNN为根据老师RNN以及第一目标损失,通过对学生RNN进行知识蒸馏得到的;所述第一目标损失指示第一输出和第二输出之间的差异;其中,
所述第一输出为所述老师RNN的输出层的输出,所述第二输出为所述学生RNN的输出层的输出;或,
所述第一输出为所述老师RNN的中间层的输出,所述第二输出为所述学生RNN的中间层的输出。
在一种可能的实现中,所述对所述第一音素和所述第二音素进行特征提取,包括:
通过目标特征提取网络对所述第一音素和所述第二音素进行处理,以获取所述第一音素的第一音频特征、以及所述第二音素的第二音频特征;所述目标特征提取网络为根据老师特征提取网络以及第二目标损失,通过对学生特征提取网络进行知识蒸馏得到的;所述第二目标损失指示第三输出和第四输出之间的差异;其中,
所述第三输出为所述老师特征提取网络的输出层的输出,所述第四输出为所述学生特征提取网络的输出层的输出;或,
所述第三输出为所述老师特征提取网络的中间层的输出,所述第四输出为所述学生特征提取网络的中间层的输出。
第二方面,本申请提供了一种文本数据处理装置,包括:
获取模块,用于获取目标文本,所述目标文本的音素包括相邻的第一音素和第二音素;
特征提取模块,用于对所述第一音素和所述第二音素进行特征提取,以获取所述第一音素的第一音频特征、以及所述第二音素的第二音频特征;
语音数据提取模块,用于通过目标循环神经网络RNN根据所述第一音频特征获取所述第一音素对应的第一语音数据,通过所述目标RNN根据所述第二音频特征获取所述第二音素对应的第二语音数据;其中,所述获取所述第一音素对应的第一语音数据和所述获取所述第二音素对应的第二语音数据的步骤并行执行;
音频提取模块,用于根据所述第一语音数据和所述第二语音数据,通过声码器获取所述第一音素和所述第二音素对应的音频。
在一种可能的实现中,所述目标RNN包括隐含层和输出层,所述语音数据提取模块,用于通过所述隐含层根据所述第一音频特征确定第一隐含层输出;
通过所述输出层根据所述第一隐含层输出确定所述第一语音数据;
通过所述隐含层根据所述第二音频特征确定第二隐含层输出;
通过所述输出层根据所述第二隐含层输出确定所述第二语音数据,其中,所述隐含层确定第二隐含层输出的过程中,所述第一隐含层输出不作为所述隐含层的输入。
在一种可能的实现中,所述第一音素的时长为N帧,所述第一音频特征的数量为N,且N个第一音频特征中的每个音频特征对应于所述N帧中的一帧,所述N个第一音频特征包括第一目标音频特征和第三目标音频特征,所述第一目标音频特征对应的帧为所述第三目标音频特征对应的帧之前相邻的帧;所述第一语音数据包括所述第一目标音频特征对应的第一子语音数据以及所述第三目标音频特征对应的第三子语音数据;
所述语音数据提取模块,用于
通过所述隐含层根据所述第三目标音频特征确定第三子隐含层输出;
通过所述隐含层根据所述第一目标音频特征和所述第三子隐含层输出确定第一子隐含层输出;
所述输出层根据所述第三子隐含层输出确定所述第三子语音数据,
所述输出层根据所述第一子隐含层输出确定所述第一子语音数据。
在一种可能的实现中,所述第一音频特征包括如下信息的至少一种:所述第一音素的基频信息或能量信息,所述第二音频特征包括如下信息的至少一种:所述第二音素的基频信息或能量信息。
在一种可能的实现中,所述第一语音数据以及所述第二语音数据为梅尔频谱MEL或巴克谱Bark。
在一种可能的实现中,所述目标RNN为根据老师RNN对学生RNN进行知识蒸馏得到的。
在一种可能的实现中,所述目标RNN为根据老师RNN以及第一目标损失,通过对学生RNN进行知识蒸馏得到的;所述第一目标损失指示第一输出和第二输出之间的差异;其中,
所述第一输出为所述老师RNN的输出层的输出,所述第二输出为所述学生RNN的输出层的输出;或,
所述第一输出为所述老师RNN的中间层的输出,所述第二输出为所述学生RNN的中间层的输出。
在一种可能的实现中,所述特征提取模块,用于通过目标特征提取网络对所述第一音素和所述第二音素进行处理,以获取所述第一音素的第一音频特征、以及所述第二音素的 第二音频特征;所述目标特征提取网络为根据老师特征提取网络以及第二目标损失,通过对学生特征提取网络进行知识蒸馏得到的;所述第二目标损失指示第三输出和第四输出之间的差异;其中,
所述第三输出为所述老师特征提取网络的输出层的输出,所述第四输出为所述学生特征提取网络的输出层的输出;或,
所述第三输出为所述老师特征提取网络的中间层的输出,所述第四输出为所述学生特征提取网络的中间层的输出。
第三方面,本申请提供了一种文本数据处理装置,可以包括处理器,处理器和存储器耦合,存储器存储有程序指令,当存储器存储的程序指令被处理器执行时实现上述第一方面所述的方法。对于处理器执行第一方面的各个可能实现方式中的步骤,具体均可以参阅第一方面,此处不再赘述。
第四方面,本申请提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序,当其在计算机上运行时,使得计算机执行上述第一方面所述的方法。
第五方面,本申请提供了一种电路系统,所述电路系统包括处理电路,所述处理电路配置为执行上述第一方面所述的方法。
第六方面,本申请提供了一种计算机程序产品,包括代码,当代码在计算机上运行时,使得计算机执行上述第一方面所述的方法。
第七方面,本申请提供了一种芯片系统,该芯片系统包括处理器,用于实现上述方面中所涉及的功能,例如,发送或处理上述方法中所涉及的数据和/或信息。在一种可能的设计中,所述芯片系统还包括存储器,所述存储器,用于保存服务器或通信设备必要的程序指令和数据。该芯片系统,可以由芯片构成,也可以包括芯片和其他分立器件。
本申请实施例提供了一种文本数据处理方法,包括:获取目标文本,所述目标文本的音素包括相邻的第一音素和第二音素;对所述第一音素和所述第二音素进行特征提取,以获取所述第一音素的第一音频特征、以及所述第二音素的第二音频特征;通过目标循环神经网络RNN根据所述第一音频特征获取所述第一音素对应的第一语音数据,通过所述目标RNN根据所述第二音频特征获取所述第二音素对应的第二语音数据;其中,所述获取所述第一音素对应的第一语音数据和所述获取所述第二音素对应的第二语音数据的步骤并行执行;根据所述第一语音数据和所述第二语音数据,通过声码器获取所述第一音素和所述第二音素对应的音频。通过上述方式,目标RNN可以并行处理第一音频特征和第二音频特征,即实现了第一音频特征和第二音频特征的处理过程的解耦,减少了目标RNN处理音频特征的时长。
图1为人工智能主体框架的一种结构示意图;
图2为一种自然语言处理系统;
图3a为本申请实施例提供的一种服务器的示意图;
图3b为本申请实施例提供的一种电子设备的示意图;
图4为本申请实施例提供的一种文本数据处理方法的示意图;
图5为本申请实施例提供的一种文本数据处理方法的示意图;
图6为本申请实施例提供的一种文本数据处理方法的示意图;
图7为本申请实施例提供的一种文本处理方法的软件架构示意;
图8为本申请实施例提供的一种文本处理方法的软件架构示意;
图9为本申请实施例提供的一种文本处理方法的软件架构示意;
图10为本申请实施例提供的一种文本处理装置的示意;
图11为本申请实施例提供的执行设备的一种结构示意图;
图12是本申请实施例提供的训练设备一种结构示意图;
图13为本申请实施例提供的芯片的一种结构示意图。
下面结合本发明实施例中的附图对本发明实施例进行描述。本发明的实施方式部分使用的术语仅用于对本发明的具体实施例进行解释,而非旨在限定本发明。
下面结合附图,对本申请的实施例进行描述。本领域普通技术人员可知,随着技术的发展和新场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,以便包含一系列单元的过程、方法、系统、产品或设备不必限于那些单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它单元。
应当理解,在本申请中,“至少一个(项)”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,用于描述关联对象的关联关系,表示可以存在三种关系,例如,“A和/或B”可以表示:只存在A,只存在B以及同时存在A和B三种情况,其中A,B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达,是指这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b或c中的至少一项(个),可以表示:a,b,c,“a和b”,“a和c”,“b和c”,或“a和b和c”,其中a,b,c可以是单个,也可以是多个。
首先对人工智能系统总体工作流程进行描述,请参见图1,图1示出的为人工智能主体框架的一种结构示意图,下面从“智能信息链”(水平轴)和“IT价值链”(垂直轴)两个维度对上述人工智能主题框架进行阐述。其中,“智能信息链”反映从数据的获取到处理的一列过程。举例来说,可以是智能信息感知、智能信息表示与形成、智能推理、智能决策、智能执行与输出的一般过程。在这个过程中,数据经历了“数据—信息—知识—智慧”的凝练过程。“IT价值链”从人智能的底层基础设施、信息(提供和处理技术实现)到系统的产业生态过程,反映人工智能为信息技术产业带来的价值。
(1)基础设施
基础设施为人工智能系统提供计算能力支持,实现与外部世界的沟通,并通过基础平台实现支撑。通过传感器与外部沟通;计算能力由智能芯片(CPU、NPU、GPU、ASIC、FPGA等硬件加速芯片)提供;基础平台包括分布式计算框架及网络等相关的平台保障和支持,可以包括云存储和计算、互联互通网络等。举例来说,传感器和外部沟通获取数据,这些数据提供给基础平台提供的分布式计算系统中的智能芯片进行计算。
(2)数据
基础设施的上一层的数据用于表示人工智能领域的数据来源。数据涉及到图形、图像、语音、文本,还涉及到传统设备的物联网数据,包括已有系统的业务数据以及力、位移、液位、温度、湿度等感知数据。
(3)数据处理
数据处理通常包括数据训练,机器学习,深度学习,搜索,推理,决策等方式。
其中,机器学习和深度学习可以对数据进行符号化和形式化的智能信息建模、抽取、预处理、训练等。
推理是指在计算机或智能系统中,模拟人类的智能推理方式,依据推理控制策略,利用形式化的信息进行机器思维和求解问题的过程,典型的功能是搜索与匹配。
决策是指智能信息经过推理后进行决策的过程,通常提供分类、排序、预测等功能。
(4)通用能力
对数据经过上面提到的数据处理后,进一步基于数据处理的结果可以形成一些通用的能力,比如可以是算法或者一个通用系统,例如,翻译,文本的分析,计算机视觉的处理,语音识别,图像的识别等等。
(5)智能产品及行业应用
智能产品及行业应用指人工智能系统在各领域的产品和应用,是对人工智能整体解决方案的封装,将智能信息决策产品化、实现落地应用,其应用领域主要包括:智能终端、智能交通、智能医疗、自动驾驶、平安城市等。
图2示出了通信系统的一个示例性的结构示意图,如图2所示,该通信系统包括服务器200和电子设备100,可选地,该通信系统可以包括一个或多个服务器并且每个服务器的覆盖范围内可以包括一个或多个电子设备,本申请对此不做限定。可选地,该通信系统还可以包括网络控制器、交换设备等其他网络实体,本申请不限于此。图2中的双向箭头表示服务器与电子设备存在通信连接,即服务器和电子设备之间可以通过通信网络实现数据传输。
需要说明的是,上述通信网络可以是局域网,也可以是通过中继(relay)设备转接的广域网,或者包括局域网和广域网。当该通信网络为局域网时,示例性的,该通信网络可以是wifi热点网络、wifi P2P网络、蓝牙网络、zigbee网络或近场通信(near field communication,NFC)网络等近距离通信网络。当该通信网络为广域网时,示例性的,该通信网络可以是第三代移动通信技术(3rd-generation wireless telephone technology,3G)网络、第四代移动通信技术(the 4th generation mobile communication technology,4G)网络、第五代移动通信技术 (5th-generation mobile communication technology,5G)网络、未来演进的公共陆地移动网络(public land mobile network,PLMN)或因特网等,本申请对此不作限定。
其中,在一种实现中,电子设备可以获取到用户输入的目标文本,电子设备可以将目标文本发送至服务器侧,服务器可以根据目标文本生成该目标文本对应的音频,服务器可以将音频发送至电子设备。
其中,在另一种实现中,电子设备可以获取到用户输入的目标文本,并根据目标文本生成该目标文本对应的音频。
应理解,图2中仅为便于理解,示意性地示出了一个通信系统,但这不应对本申请构成任何限定,该通信系统中还可以包括更多数量的服务器,也可以包括更多数量的电子设备,与不同的电子设备通信的服务器可以是相同的服务器,也可以是不同的服务器,与不同的电子设备通信的服务器的数量可以相同,也可以不同,本申请对此不做限定。还应理解,该通信系统中的服务器可以是任意一种具有收发功能的设备或可设置于该设备的芯片。图3a示出了服务器200的一个示例性的结构示意图,服务器200的结构可以参考图3a所示的结构。
服务器包括至少一个处理器201、至少一个存储器202和至少一个网络接口203。处理器201、存储器202和网络接口203相连,例如通过总线相连,在本申请中,所述连接可包括各类接口、传输线或总线等,本实施例对此不做限定。网络接口203用于使得服务器通过通信链路,与其它通信设备相连,例如以太网接口。
处理器201主要用于对通信数据进行处理,以及对整个服务器进行控制,执行软件程序,处理软件程序的数据,例如用于支持服务器执行实施例中所描述的动作。处理器201主要用于对整个服务器进行控制,执行软件程序,处理软件程序的数据。本领域技术人员可以理解,服务器可以包括多个处理器以增强其处理能力,服务器的各个部件可以通过各种总线连接。处理器201也可以表述为处理电路或者处理器芯片。
存储器202主要用于存储软件程序和数据。存储器202可以是独立存在,与处理器201相连。可选的,存储器202可以和处理器201集成在一起,例如集成在一个芯片之内。其中,存储器202能够存储执行本申请的技术方案的程序代码,并由处理器201来控制执行,被执行的各类计算机程序代码也可被视为是处理器201的驱动程序。
图3a仅示出了一个存储器和一个处理器。在实际的服务器中,可以存在多个处理器和多个存储器。存储器也可以称为存储介质或者存储设备等。存储器可以为与处理器处于同一芯片上的存储元件,即片内存储元件,或者为独立的存储元件,本申请对此不做限定。
还应理解,该通信系统中的电子设备又可称之为用户设备(user equipment,UE),可以部署在陆地上,包括室内或室外、手持或车载;也可以部署在水面上(如轮船等);还可以部署在空中(例如飞机、气球和卫星上等)。电子设备可以是手机(mobile phone)、平板电脑(pad)、具备无线通讯功能的可穿戴设备(如智能手表)、具有定位功能的位置追踪器、带无线收发功能的电脑、虚拟现实(virtual reality,VR)设备、增强现实(augmented reality,AR)设备、智慧家庭(smart home)中的无线设备等,本申请对此不作限定。本申请中将前述电子设备及可设置于前述电子设备的芯片统称为电子设备。
本申请中的电子设备可以包括但不限于:智能移动电话、电视、平板电脑、手环、头戴显示设备(Head Mount Display,HMD)、增强现实(augmented reality,AR)设备,混合现实(mixed reality,MR)设备、蜂窝电话(cellular phone)、智能电话(smart phone)、个人数字助理(personal digital assistant,PDA)、平板型电脑、车载电子设备、膝上型电脑(laptop computer)、个人电脑(personal computer,PC)、监控设备、机器人、车载终端、自动驾驶车辆等。当然,在以下实施例中,对该电子设备的具体形式不作任何限制。
示例性地,参阅图3b,下面以一个具体的结构为例,对本申请提供的电子设备的结构进行示例性说明。
电子设备100可以包括处理器110,外部存储器接口120,内部存储器121,通用串行总线(universal serial bus,USB)接口130,充电管理模块140,电源管理模块141,电池142,天线1,天线2,移动通信模块150,无线通信模块160,音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,传感器模块180,按键190,马达191,指示器192,摄像头193,显示屏194,以及用户标识模块(subscriber identification module,SIM)卡接口195等。其中传感器模块180可以包括压力传感器180A,陀螺仪传感器180B,气压传感器180C,磁传感器180D,加速度传感器180E,距离传感器180F,接近光传感器180G,指纹传感器180H,温度传感器180J,触摸传感器180K,环境光传感器180L,骨传导传感器180M,运动传感器180N等。
可以理解的是,本发明实施例示意的结构并不构成对电子设备100的具体限定。在本申请另一些实施例中,电子设备100可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。
处理器110可以包括一个或多个处理单元,例如:处理器110可以包括应用处理器(application processor,AP),调制解调处理器,图形处理器(graphics processing unit,GPU),图像信号处理器(image signal processor,ISP),控制器,视频编解码器,数字信号处理器(digital signal processor,DSP),基带处理器,和/或神经网络处理器(neural-network processing unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。
控制器可以根据指令操作码和时序信号,产生操作控制信号,完成取指令和执行指令的控制。
处理器110中还可以设置存储器,用于存储指令和数据。在一些实施例中,处理器110中的存储器为高速缓冲存储器。该存储器可以保存处理器110刚用过或循环使用的指令或数据。如果处理器110需要再次使用该指令或数据,可从所述存储器中直接调用。避免了重复存取,减少了处理器110的等待时间,因而提高了系统的效率。
在一些实施例中,处理器110可以包括一个或多个接口。接口可以包括集成电路(inter-integrated circuit,I2C)接口,集成电路内置音频(inter-integrated circuit sound,I2S)接口,脉冲编码调制(pulse code modulation,PCM)接口,通用异步收发传输器(universal asynchronous receiver/transmitter,UART)接口,移动产业处理器接口(mobile industry processor interface,MIPI),通用输入输出(general-purpose input/output,GPIO)接口,用户 标识模块(subscriber identity module,SIM)接口,和/或通用串行总线(universal serial bus,USB)接口等。
I2C接口是一种双向同步串行总线,包括一根串行数据线(serial data line,SDA)和一根串行时钟线(derail clock line,SCL)。在一些实施例中,处理器110可以包含多组I2C总线。处理器110可以通过不同的I2C总线接口分别耦合触摸传感器180K,充电器,闪光灯,摄像头193等。例如:处理器110可以通过I2C接口耦合触摸传感器180K,使处理器110与触摸传感器180K通过I2C总线接口通信,实现电子设备100的触摸功能。
I2S接口可以用于音频通信。在一些实施例中,处理器110可以包含多组I2S总线。处理器110可以通过I2S总线与音频模块170耦合,实现处理器110与音频模块170之间的通信。在一些实施例中,音频模块170可以通过I2S接口向无线通信模块160传递音频信号,实现通过蓝牙耳机接听电话的功能。
PCM接口也可以用于音频通信,将模拟信号抽样,量化和编码。在一些实施例中,音频模块170与无线通信模块160可以通过PCM总线接口耦合。在一些实施例中,音频模块170也可以通过PCM接口向无线通信模块160传递音频信号,实现通过蓝牙耳机接听电话的功能。所述I2S接口和所述PCM接口都可以用于音频通信。
UART接口是一种通用串行数据总线,用于异步通信。该总线可以为双向通信总线。它将要传输的数据在串行通信与并行通信之间转换。在一些实施例中,UART接口通常被用于连接处理器110与无线通信模块160。例如:处理器110通过UART接口与无线通信模块160中的蓝牙模块通信,实现蓝牙功能。在一些实施例中,音频模块170可以通过UART接口向无线通信模块160传递音频信号,实现通过蓝牙耳机播放音乐的功能。
MIPI接口可以被用于连接处理器110与显示屏194,摄像头193等外围器件。MIPI接口包括摄像头串行接口(camera serial interface,CSI),显示屏串行接口(display serial interface,DSI)等。在一些实施例中,处理器110和摄像头193通过CSI接口通信,实现电子设备100的拍摄功能。处理器110和显示屏194通过DSI接口通信,实现电子设备100的显示功能。
GPIO接口可以通过软件配置。GPIO接口可以被配置为控制信号,也可被配置为数据信号。在一些实施例中,GPIO接口可以用于连接处理器110与摄像头193,显示屏194,无线通信模块160,音频模块170,传感器模块180等。GPIO接口还可以被配置为I2C接口,I2S接口,UART接口,MIPI接口等。
USB接口130是符合USB标准规范的接口,具体可以是Mini USB接口,Micro USB接口,USB Type C接口等。USB接口130可以用于连接充电器为电子设备100充电,也可以用于电子设备100与外围设备之间传输数据。也可以用于连接耳机,通过耳机播放音频。该接口还可以用于连接其他电子设备,例如AR设备等。
可以理解的是,本发明实施例示意的各模块间的接口连接关系,只是示意性说明,并不构成对电子设备100的结构限定。在本申请另一些实施例中,电子设备100也可以采用上述实施例中不同的接口连接方式,或多种接口连接方式的组合。
充电管理模块140用于从充电器接收充电输入。其中,充电器可以是无线充电器,也可以是有线充电器。在一些有线充电的实施例中,充电管理模块140可以通过USB接口130 接收有线充电器的充电输入。在一些无线充电的实施例中,充电管理模块140可以通过电子设备100的无线充电线圈接收无线充电输入。充电管理模块140为电池142充电的同时,还可以通过电源管理模块141为电子设备供电。
电源管理模块141用于连接电池142,充电管理模块140与处理器110。电源管理模块141接收电池142和/或充电管理模块140的输入,为处理器110,内部存储器121,显示屏194,摄像头193,和无线通信模块160等供电。电源管理模块141还可以用于监测电池容量,电池循环次数,电池健康状态(漏电,阻抗)等参数。在其他一些实施例中,电源管理模块141也可以设置于处理器110中。在另一些实施例中,电源管理模块141和充电管理模块140也可以设置于同一个器件中。
电子设备100的无线通信功能可以通过天线1,天线2,移动通信模块150,无线通信模块160,调制解调处理器以及基带处理器等实现。
天线1和天线2用于发射和接收电磁波信号。电子设备100中的每个天线可用于覆盖单个或多个通信频带。不同的天线还可以复用,以提高天线的利用率。例如:可以将天线1复用为无线局域网的分集天线。在另外一些实施例中,天线可以和调谐开关结合使用。
移动通信模块150可以提供应用在电子设备100上的包括2G/3G/4G/5G等无线通信的解决方案。移动通信模块150可以包括至少一个滤波器,开关,功率放大器,低噪声放大器(low noise amplifier,LNA)等。移动通信模块150可以由天线1接收电磁波,并对接收的电磁波进行滤波,放大等处理,传送至调制解调处理器进行解调。移动通信模块150还可以对经调制解调处理器调制后的信号放大,经天线1转为电磁波辐射出去。在一些实施例中,移动通信模块150的至少部分功能模块可以被设置于处理器110中。在一些实施例中,移动通信模块150的至少部分功能模块可以与处理器110的至少部分模块被设置在同一个器件中。
调制解调处理器可以包括调制器和解调器。其中,调制器用于将待发送的低频基带信号调制成中高频信号。解调器用于将接收的电磁波信号解调为低频基带信号。随后解调器将解调得到的低频基带信号传送至基带处理器处理。低频基带信号经基带处理器处理后,被传递给应用处理器。应用处理器通过音频设备(不限于扬声器170A,受话器170B等)输出声音信号,或通过显示屏194显示图像或视频。在一些实施例中,调制解调处理器可以是独立的器件。在另一些实施例中,调制解调处理器可以独立于处理器110,与移动通信模块150或其他功能模块设置在同一个器件中。
无线通信模块160可以提供应用在电子设备100上的包括无线局域网(wireless local area networks,WLAN)(如无线保真(wireless fidelity,Wi-Fi)网络),蓝牙(bluetooth,BT),全球导航卫星系统(global navigation satellite system,GNSS),调频(frequency modulation,FM),近距离无线通信技术(near field communication,NFC),红外技术(infrared,IR)等无线通信的解决方案。无线通信模块160可以是集成至少一个通信处理模块的一个或多个器件。无线通信模块160经由天线2接收电磁波,将电磁波信号调频以及滤波处理,将处理后的信号发送到处理器110。无线通信模块160还可以从处理器110接收待发送的信号,对其进行调频,放大,经天线2转为电磁波辐射出去。
在一些实施例中,电子设备100的天线1和移动通信模块150耦合,天线2和无线通信模块160耦合,使得电子设备100可以通过无线通信技术与网络以及其他设备通信。所述无线通信技术可以包括但不限于:第五代移动通信技术(5th-Generation,5G)系统,全球移动通讯系统(global system for mobile communications,GSM),通用分组无线服务(general packet radio service,GPRS),码分多址接入(code division multiple access,CDMA),宽带码分多址(wideband code division multiple access,WCDMA),时分码分多址(time-division code division multiple access,TD-SCDMA),长期演进(long term evolution,LTE),蓝牙(bluetooth),全球导航卫星系统(the global navigation satellite system,GNSS),无线保真(wireless fidelity,WiFi),近距离无线通信(near field communication,NFC),FM(也可以称为调频广播),紫蜂协议(Zigbee),射频识别技术(radio frequency identification,RFID)和/或红外(infrared,IR)技术等。所述GNSS可以包括全球卫星定位系统(global positioning system,GPS),全球导航卫星系统(global navigation satellite system,GLONASS),北斗卫星导航系统(beidou navigation satellite system,BDS),准天顶卫星系统(quasi-zenith satellite system,QZSS)和/或星基增强系统(satellite based augmentation systems,SBAS)等。
在一些实施方式中,电子设备100也可以包括有线通信模块(图1中未示出),或者,此处的移动通信模块150或者无线通信模块160可以替换为有线通信模块(图1中未示出),该有线通信模块可以使电子设备通过有线网络与其他设备进行通信。该有线网络可以包括但不限于以下一项或者多项:光传送网(optical transport network,OTN)、同步数字体系(synchronous digital hierarchy,SDH)、无源光网络(passive optical network,PON)、以太网(Ethernet)、或灵活以太网(flex Ethernet,FlexE)等。
电子设备100通过GPU,显示屏194,以及应用处理器等实现显示功能。GPU为图像处理的微处理器,连接显示屏194和应用处理器。GPU用于执行数学和几何计算,用于图形渲染。处理器110可包括一个或多个GPU,其执行程序指令以生成或改变显示信息。
显示屏194用于显示图像,视频等。显示屏194包括显示面板。显示面板可以采用液晶显示屏(liquid crystal display,LCD),有机发光二极管(organic light-emitting diode,OLED),有源矩阵有机发光二极体或主动矩阵有机发光二极体(active-matrix organic light emitting diode的,AMOLED),柔性发光二极管(flex light-emitting diode,FLED),Miniled,MicroLed,Micro-oLed,量子点发光二极管(quantum dot light emitting diodes,QLED)等。在一些实施例中,电子设备100可以包括1个或N个显示屏194,N为大于1的正整数。
电子设备100可以通过ISP,摄像头193,视频编解码器,GPU,显示屏194以及应用处理器等实现拍摄功能。
ISP用于处理摄像头193反馈的数据。例如,拍照时,打开快门,光线通过镜头被传递到摄像头感光元件上,光信号转换为电信号,摄像头感光元件将所述电信号传递给ISP处理,转化为肉眼可见的图像。ISP还可以对图像的噪点,亮度,肤色进行算法优化。ISP还可以对拍摄场景的曝光,色温等参数优化。在一些实施例中,ISP可以设置在摄像头193中。
摄像头193用于捕获静态图像或视频。物体通过镜头生成光学图像投射到感光元件。 感光元件可以是电荷耦合器件(charge coupled device,CCD)或互补金属氧化物半导体(complementary metal-oxide-semiconductor,CMOS)光电晶体管。感光元件把光信号转换成电信号,之后将电信号传递给ISP转换成数字图像信号。ISP将数字图像信号输出到DSP加工处理。DSP将数字图像信号转换成标准的RGB摄像头,YUV等格式的图像信号。在一些实施例中,电子设备100可以包括1个或N个摄像头193,N为大于1的正整数。
数字信号处理器用于处理数字信号,除了可以处理数字图像信号,还可以处理其他数字信号。例如,当电子设备100在频点选择时,数字信号处理器用于对频点能量进行傅里叶变换等。
视频编解码器用于对数字视频压缩或解压缩。电子设备100可以支持一种或多种视频编解码器。这样,电子设备100可以播放或录制多种编码格式的视频,例如:动态图像专家组(moving picture experts group,MPEG)1,MPEG2,MPEG3,MPEG4等。
NPU为神经网络(neural-network,NN)计算处理器,通过借鉴生物神经网络结构,例如借鉴人脑神经元之间传递模式,对输入信息快速处理,还可以不断的自学习。通过NPU可以实现电子设备100的智能认知等应用,例如:图像识别,人脸识别,语音识别,文本理解等。
外部存储器接口120可以用于连接外部存储卡,例如Micro SD卡,实现扩展电子设备100的存储能力。外部存储卡通过外部存储器接口120与处理器110通信,实现数据存储功能。例如将音乐,视频等文件保存在外部存储卡中。
内部存储器121可以用于存储计算机可执行程序代码,所述可执行程序代码包括指令。内部存储器121可以包括存储程序区和存储数据区。其中,存储程序区可存储操作系统,至少一个功能所需的应用程序(比如声音播放功能,图像播放功能等)等。存储数据区可存储电子设备100使用过程中所创建的数据(比如音频数据,电话本等)等。此外,内部存储器121可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件,闪存器件,通用闪存存储器(universal flash storage,UFS)等。处理器110通过运行存储在内部存储器121的指令,和/或存储在设置于处理器中的存储器的指令,执行电子设备100的各种功能应用以及数据处理。
电子设备100可以通过音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,以及应用处理器等实现音频功能。例如音乐播放,录音等。
音频模块170用于将数字音频信息转换成模拟音频信号输出,也用于将模拟音频输入转换为数字音频信号。音频模块170还可以用于对音频信号编码和解码。在一些实施例中,音频模块170可以设置于处理器110中,或将音频模块170的部分功能模块设置于处理器110中。
扬声器170A,也称“喇叭”,用于将音频电信号转换为声音信号。电子设备100可以通过扬声器170A收听音乐,或收听免提通话。
受话器170B,也称“听筒”,用于将音频电信号转换成声音信号。当电子设备100接听电话或语音信息时,可以通过将受话器170B靠近人耳接听语音。
麦克风170C,也称“话筒”,“传声器”,用于将声音信号转换为电信号。当拨打电话或 发送语音信息时,用户可以通过人嘴靠近麦克风170C发声,将声音信号输入到麦克风170C。电子设备100可以设置至少一个麦克风170C。在另一些实施例中,电子设备100可以设置两个麦克风170C,除了采集声音信号,还可以实现降噪功能。在另一些实施例中,电子设备100还可以设置三个,四个或更多麦克风170C,实现采集声音信号,降噪,还可以识别声音来源,实现定向录音功能等。
耳机接口170D用于连接有线耳机。耳机接口170D可以是USB接口130,也可以是3.5mm的开放移动电子设备平台(open mobile terminal platform,OMTP)标准接口,美国蜂窝电信工业协会(cellular telecommunications industry association of the USA,CTIA)标准接口。
压力传感器180A用于感受压力信号,可以将压力信号转换成电信号。在一些实施例中,压力传感器180A可以设置于显示屏194。压力传感器180A的种类很多,如电阻式压力传感器,电感式压力传感器,电容式压力传感器等。电容式压力传感器可以是包括至少两个具有导电材料的平行板。当有力作用于压力传感器180A,电极之间的电容改变。电子设备100根据电容的变化确定压力的强度。当有触摸操作作用于显示屏194,电子设备100根据压力传感器180A检测所述触摸操作强度。电子设备100也可以根据压力传感器180A的检测信号计算触摸的位置。在一些实施例中,作用于相同触摸位置,但不同触摸操作强度的触摸操作,可以对应不同的操作指令。例如:当有触摸操作强度小于第一压力阈值的触摸操作作用于短消息应用图标时,执行查看短消息的指令。当有触摸操作强度大于或等于第一压力阈值的触摸操作作用于短消息应用图标时,执行新建短消息的指令。陀螺仪传感器180B可以用于确定电子设备100的运动姿态。在一些实施例中,可以通过陀螺仪传感器180B确定电子设备100围绕三个轴(即,x,y和z轴)的角速度。陀螺仪传感器180B可以用于拍摄防抖。示例性的,当按下快门,陀螺仪传感器180B检测电子设备100抖动的角度,根据角度计算出镜头模组需要补偿的距离,让镜头通过反向运动抵消电子设备100的抖动,实现防抖。陀螺仪传感器180B还可以用于导航,体感游戏场景。气压传感器180C用于测量气压。在一些实施例中,电子设备100通过气压传感器180C测得的气压值计算海拔高度,辅助定位和导航。磁传感器180D包括霍尔传感器。电子设备100可以利用磁传感器180D检测翻盖皮套的开合。在一些实施例中,当电子设备100是翻盖机时,电子设备100可以根据磁传感器180D检测翻盖的开合。进而根据检测到的皮套的开合状态或翻盖的开合状态,设置翻盖自动解锁等特性。加速度传感器180E可检测电子设备100在各个方向上(一般为三轴)加速度的大小。当电子设备100静止时可检测出重力的大小及方向。还可以用于识别电子设备姿态,应用于横竖屏切换,计步器等应用。距离传感器180F,用于测量距离。电子设备100可以通过红外或激光测量距离。在一些实施例中,拍摄场景,电子设备100可以利用距离传感器180F测距以实现快速对焦。接近光传感器180G可以包括例如发光二极管(LED)和光检测器,例如光电二极管。发光二极管可以是红外发光二极管。电子设备100通过发光二极管向外发射红外光。电子设备100使用光电二极管检测来自附近物体的红外反射光。当检测到充分的反射光时,可以确定电子设备100附近有物体。当检测到不充分的反射光时,电子设备100可以确定电子设备100附近没有物体。电子设备100可以利用接近光传感器180G检测用户手持电子设备100贴近耳朵通话,以便自动熄灭屏幕达到省电 的目的。接近光传感器180G也可用于皮套模式,口袋模式自动解锁与锁屏。环境光传感器180L用于感知环境光亮度。电子设备100可以根据感知的环境光亮度自适应调节显示屏194亮度。环境光传感器180L也可用于拍照时自动调节白平衡。环境光传感器180L还可以与接近光传感器180G配合,检测电子设备100是否在口袋里,以防误触。指纹传感器180H用于采集指纹。电子设备100可以利用采集的指纹特性实现指纹解锁,访问应用锁,指纹拍照,指纹接听来电等。温度传感器180J用于检测温度。触摸传感器180K,也称“触控器件”。触摸传感器180K可以设置于显示屏194,由触摸传感器180K与显示屏194组成触摸屏,也称“触控屏”。触摸传感器180K用于检测作用于其上或附近的触摸操作。触摸传感器可以将检测到的触摸操作传递给应用处理器,以确定触摸事件类型。可以通过显示屏194提供与触摸操作相关的视觉输出。在另一些实施例中,触摸传感器180K也可以设置于电子设备100的表面,与显示屏194所处的位置不同。骨传导传感器180M可以获取振动信号。
运动传感器180N,可以用于对摄像头拍摄的范围内的运动物体进行检测,采集运动物体的运动轮廓或者运动轨迹等。例如,该运动传感器180N可以是红外传感器、激光传感器、动态视觉传感器(dynamic vision sensor,DVS)等,该DVS具体可以包括DAVIS(Dynamic and Active-pixel Vision Sensor)、ATIS(Asynchronous Time-based Image Sensor)或者CeleX传感器等传感器。DVS借鉴了生物视觉的特性,每个像素模拟一个神经元,独立地对光照强度(以下简称“光强”)的相对变化做出响应。当光强的相对变化超过阈值时,像素会输出一个事件信号,包括像素的位置、时间戳以及光强的特征信息。
按键190包括开机键,音量键等。按键190可以是机械按键。也可以是触摸式按键。电子设备100可以接收按键输入,产生与电子设备100的用户设置以及功能控制有关的键信号输入。
马达191可以产生振动提示。
指示器192可以是指示灯,可以用于指示充电状态,电量变化,也可以用于指示消息,未接来电,通知等。
SIM卡接口195用于连接SIM卡。SIM卡可以通过插入SIM卡接口195,或从SIM卡接口195拔出,实现和电子设备100的接触和分离。电子设备100可以支持1个或N个SIM卡接口,N为大于1的正整数。SIM卡接口195可以支持Nano SIM卡,Micro SIM卡,SIM卡等。同一个SIM卡接口195可以同时插入多张卡。所述多张卡的类型可以相同,也可以不同。SIM卡接口195也可以兼容不同类型的SIM卡。SIM卡接口195也可以兼容外部存储卡。电子设备100通过SIM卡和网络交互,实现通话以及数据通信等功能。在一些实施例中,电子设备100采用eSIM,即:嵌入式SIM卡。eSIM卡可以嵌在电子设备100中,不能和电子设备100分离。
电子设备300从逻辑上可划分为硬件层、操作系统311,以及应用程序层。硬件层包括如上所述的应用处理器301、MCU 302、存储器303、modem 304、Wi-Fi模块306、传感器308、定位模块310等硬件资源。本申请对电子设备300搭载的操作系统类型不作任何限制。
由于本申请实施例涉及大量神经网络的应用,为了便于理解,下面先对本申请实施例涉及的相关术语及神经网络等相关概念进行介绍。
(1)神经网络
神经网络可以是由神经单元组成的,神经单元可以是指以xs和截距1为输入的运算单元,该运算单元的输出可以为:
其中,s=1、2、……n,n为大于1的自然数,Ws为xs的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入。激活函数可以是sigmoid函数。神经网络是将许多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。
神经网络中的每一层的工作可以用数学表达式
来描述:从物理层面神经网络中的每一层的工作可以理解为通过五种对输入空间(输入向量的集合)的操作,完成输入空间到输出空间的变换(即矩阵的行空间到列空间),这五种操作包括:1、升维/降维;2、放大/缩小;3、旋转;4、平移;5、“弯曲”。其中1、2、3的操作由
完成,4的操作由+b完成,5的操作则由a()来实现。这里之所以用“空间”二字来表述是因为被分类的对象并不是单个事物,而是一类事物,空间是指这类事物所有个体的集合。其中,W是权重向量,该向量中的每一个值表示该层神经网络中的一个神经元的权重值。该向量W决定着上文所述的输入空间到输出空间的空间变换,即每一层的权重W控制着如何变换空间。训练神经网络的目的,也就是最终得到训练好的神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。因此,神经网络的训练过程本质上就是学习控制空间变换的方式,更具体的就是学习权重矩阵。
因为希望神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有初始化的过程,即为神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断的调整,直到神经网络能够预测出真正想要的目标值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么神经网络的训练就变成了尽可能缩小这个loss的过程。
(2)深度神经网络
深度神经网络(Deep Neural Network,DNN),也称多层神经网络,可以理解为具有很多层隐含层的神经网络,这里的“很多”并没有特别的度量标准。从DNN按不同层的位 置划分,DNN内部的神经网络可以分为三类:输入层,隐含层,输出层。一般来说第一层是输入层,最后一层是输出层,中间的层数都是隐含层。层与层之间是全连接的,也就是说,第i层的任意一个神经元一定与第i+1层的任意一个神经元相连。虽然DNN看起来很复杂,但是就每一层的工作来说,其实并不复杂,简单来说就是如下线性关系表达式:
其中,
是输入向量,
是输出向量,
是偏移向量,W是权重矩阵(也称系数),α()是激活函数。每一层仅仅是对输入向量
经过如此简单的操作得到输出向量
由于DNN层数多,则系数W和偏移向量
的数量也就很多了。这些参数在DNN中的定义如下所述:以系数W为例:假设在一个三层的DNN中,第二层的第4个神经元到第三层的第2个神经元的线性系数定义为
上标3代表系数W所在的层数,而下标对应的是输出的第三层索引2和输入的第二层索引4。总结就是:第L-1层的第k个神经元到第L层的第j个神经元的系数定义为
需要注意的是,输入层是没有W参数的。在深度神经网络中,更多的隐含层让网络更能够刻画现实世界中的复杂情形。理论上而言,参数越多的模型复杂度越高,“容量”也就越大,也就意味着它能完成更复杂的学习任务。训练深度神经网络的也就是学习权重矩阵的过程,其最终目的是得到训练好的深度神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。
(3)卷积神经网络
卷积神经网络(CNN,Convolutional Neuron Network)是一种带有卷积结构的深度神经网络。卷积神经网络包含了一个由卷积层和子采样层构成的特征抽取器。该特征抽取器可以看作是滤波器,卷积过程可以看作是使用一个可训练的滤波器与一个输入的图像或者卷积特征平面(feature map)做卷积。卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元层。在卷积神经网络的卷积层中,一个神经元可以只与部分邻层神经元连接。一个卷积层中,通常包含若干个特征平面,每个特征平面可以由一些矩形排列的神经单元组成。同一特征平面的神经单元共享权重,这里共享的权重就是卷积核。共享权重可以理解为提取图像信息的方式与位置无关。这其中隐含的原理是:图像的某一部分的统计信息与其他部分是一样的。即意味着在某一部分学习的图像信息也能用在另一部分上。所以对于图像上的所有位置,都能使用同样的学习得到的图像信息。在同一卷积层中,可以使用多个卷积核来提取不同的图像信息,一般地,卷积核数量越多,卷积操作反映的图像信息越丰富。
卷积核可以以随机大小的矩阵的形式初始化,在卷积神经网络的训练过程中卷积核可以通过学习得到合理的权重。另外,共享权重带来的直接好处是减少卷积神经网络各层之间的连接,同时又降低了过拟合的风险。
(4)循环神经网络(RNN,Recurrent Neural Networks)是用来处理序列数据的。在传统的神经网络模型中,是从输入层到隐含层再到输出层,层与层之间是全连接的,而对于每一层层内之间的各个节点是无连接的。这种普通的神经网络虽然解决了很多难题,但是却仍然对很多问题却无能无力。例如,你要预测句子的下一个单词是什么,一般需要用到前面的单词,因为一个句子中前后单词并不是独立的。RNN之所以称为循环神经网路,即一个序列当前的输出与前面的输出也有关。具体的表现形式为网络会对前面的信息进行记忆并应用于当前输出的计算中,即隐含层本层之间的节点不再无连接而是有连接的,并且隐含层的输入不仅包括输入层的输出还包括上一时刻隐含层的输出。理论上,RNN能够对任何长度的序列数据进行处理。对于RNN的训练和对传统的CNN或DNN的训练一样。同样使用误差反向传播算法,不过有一点区别:即,如果将RNN进行网络展开,那么其中的参数,如W,是共享的;而如上举例上述的传统神经网络却不是这样。并且在使用梯度下降算法中,每一步的输出不仅依赖当前步的网络,还依赖前面若干步网络的状态。该学习算法称为基于时间的反向传播算法Back propagation Through Time(BPTT)。
既然已经有了卷积神经网络,为什么还要循环神经网络?原因很简单,在卷积神经网络中,有一个前提假设是:元素之间是相互独立的,输入与输出也是独立的,比如猫和狗。但现实世界中,很多元素都是相互连接的,比如股票随时间的变化,再比如一个人说了:我喜欢旅游,其中最喜欢的地方是云南,以后有机会一定要去。这里填空,人类应该都知道是填“云南”。因为人类会根据上下文的内容进行推断,但如何让机器做到这一步?RNN就应运而生了。RNN旨在让机器像人一样拥有记忆的能力。因此,RNN的输出就需要依赖当前的输入信息和历史的记忆信息。
(5)损失函数
在训练深度神经网络的过程中,因为希望深度神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有初始化的过程,即为深度神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断的调整,直到深度神经网络能够预测出真正想要的目标值或与真正想要的目标值非常接近的值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。
(6)反向传播算法
卷积神经网络可以采用误差反向传播(back propagation,BP)算法在训练过程中修正初始的超分辨率模型中参数的大小,使得超分辨率模型的重建误差损失越来越小。具体地,前向传递输入信号直至输出会产生误差损失,通过反向传播误差损失信息来更新初始的超 分辨率模型中参数,从而使误差损失收敛。反向传播算法是以误差损失为主导的反向传播运动,旨在得到最优的超分辨率模型的参数,例如权重矩阵。
(7)基频(fundamental frequency)
当发声体由于振动而发出声音时,声音一般可以分解为许多单纯的正弦波,也就是说所有的自然声音基本都是由许多频率不同的正弦波组成的,其中频率最低的正弦波即为基音(即基频,用F0表示),而其他频率较高的正弦波则为泛音。
(8)能量(energy)
能量又称强度或音量,可以代表声音的大小,可由声音讯号的振幅来模拟,振幅越大,代表此声音波形的音量越大。
(9)韵律(prosody)
在语音合成领域里,韵律泛指控制语调、音调、重音强调、停顿和节奏等的功能的特征。韵律可以反映出说话者的情感状态,讲话形式等。
(10)声码器(vocoder)
声码器是一种声音信号处理模块或软件,其能将声学特征编码成声音波形。
下面分别从神经网络的训练侧和神经网络的应用侧对本申请提供的方法进行描述。
本申请实施例提供的神经网络的训练方法,涉及自然语言数据的处理,具体可以应用于数据训练、机器学习、深度学习等数据处理方法,对训练数据进行符号化和形式化的智能信息建模、抽取、预处理、训练等,最终得到训练好的文本处理模型(包括特征提取模型以及循环神经网络RNN);并且,本申请实施例提供的文本数据处理方法可以运用上述训练好的文本处理模型,将输入数据(如本申请中的目标文本以)输入到所述训练好的文本处理模型中,得到输出数据(如本申请中的音频)。需要说明的是,本申请实施例提供的文本处理模型的训练方法和文本处理方法是基于同一个构思产生的发明,也可以理解为一个系统中的两个部分,或一个整体流程的两个阶段:如模型训练阶段和模型应用阶段。
参见图4,图4为本申请实施例提供的一种文本数据处理方法,如图4中示出的那样,本实施例提供的文本数据处理方法包括:
401、获取目标文本,所述目标文本的音素包括相邻的第一音素和第二音素。
步骤401的执行主体可以为电子设备,具体的,用户可以在电子设备上输入需要进行音频转换的目标文本,相应的,电子设备可以获取到目标文本。
步骤401的执行主体可以为服务器,具体的,用户可以在电子设备上输入需要进行音频转换的目标文本,电子设备可以将目标文本发送至服务器,相应的,服务器可以获取到目标文本。
本申请实施例中,电子设备可以显示文本输入框以及文本输入指示,所述目标指示用于指示用户在所述文本输入框中输入需要进行音频转换的文本,电子设备可以获取用户在所述文本输入框输入的目标文本。
在一种场景中,电子设备上可以安装有可以根据目标文本生成该目标文本对应的音频的应用程序,用户可以打开相关的应用程序,在应用程序中输入需要进行音频转换的目标 文本,进而电子设备可以根据目标文本生成该目标文本对应的音频,或者将目标文本发送至服务器,由服务器根据目标文本生成该目标文本对应的音频。
本申请实施例中,在获取到目标文本之后,可以根据文本转语音(text to speech,TTS)模型,对所述目标文本进行处理,以得到目标文本对应的音频。
本申请实施例中,在获取到目标文本之后,可以对目标文本进行预处理,将目标文本处理为适配于TTS模型输入格式的序列,示例性的,服务器可以对目标文本进行文本归一化,将不规范的目标文本转为可发音格式,并进行分词处理,按词语为单位分割目标文本中的句子,来解决句子歧义性,并进行韵律分析,预测目标文本中各个句子的停顿节奏和/或重音等,并将目标文本的字转换为音素级别,以得到音素串(也就是目标文本的因素),并将音素串转换为TTS模型需要的序列格式(后续实施例可以称之为ID序列)。
应理解,音素(phone)也可以称之为发声音素,是根据语音的自然属性划分出来的最小语音单位,依据音节里的发音动作来分析,一个动作构成一个音素。音素分为元音与辅音两大类。例如,汉语音节a(例如,一声:啊)只有一个音素,ai(例如四声:爱)有两个音素,dai(例如一声:呆)有三个音素等。
示例性的,目标英文文本为“governments have made policy decisions”,该目标文本的音素为“G AH1V ER0 M AH0 N T HH AE1 V M EY1 D P AA1 L AH0 S IY0 D IH0 S IH1 ZH AH0 N Z”。再例如,目标中文文本“今天天气怎么样”的音素为“j”“in”“t”“i”“an”……。
本申请实施例中,目标文本的音素可以包括相邻的第一音素和第二音素。目标文本的音素是由多个音素按照特定的顺序进行排列的因素序列,第一音素和第二音素可以是上述因素序列中任意相邻的两个音素。
具体的,可以获取目标文本的M个音素,并通过神经网络对所述M个音素进行处理,以获取M个特征向量,参照图5,可以将目标文本转换为序列化的身份标识ID(IDentity)序列,ID序列中的每个标识可以对应于M个音素中的一个音素,相应的,ID序列中包括两个相邻的标识,该相邻的两个标识分别对应于第一音素和第二音素。
402、对所述第一音素和所述第二音素进行特征提取,以获取所述第一音素的第一音频特征、以及所述第二音素的第二音频特征。
步骤402的执行主体可以是电子设备也可以是服务器。
具体的,电子设备可以获取目标文本,并将目标文本发送至服务器,服务器可以对所述第一音素和所述第二音素进行特征提取,以获取所述第一音素的第一音频特征、以及所述第二音素的第二音频特征;或者,电子设备可以获取目标文本,并对所述第一音素和所述第二音素进行特征提取,以获取所述第一音素的第一音频特征、以及所述第二音素的第二音频特征;或者,服务器可以获取目标文本,并对所述第一音素和所述第二音素进行特征提取,以获取所述第一音素的第一音频特征、以及所述第二音素的第二音频特征。
应理解,可以对目标文本的音素进行特征提取,以得到多个音频特征,其中,多个音频特征包括所述第一音素的第一音频特征、以及所述第二音素的第二音频特征。
接下来描述,如何对所述第一音素和所述第二音素进行特征提取,以获取所述第一音 素的第一音频特征、以及所述第二音素的第二音频特征。
本申请实施例中,可以利用编码器(例如图5中示出的卷积神经网络(convolutional neural networks,CNN)、循环神经网络(recurrent neural network,RNN)、transformer等网络结构或混合网络结构)对目标文本的音素进行特征提取。例如,编码器的串行结构可以但不限于包括LUP层(其维度为512)、3层512个卷积核kernel大小为5的filter及1层隐含层为512的双向循环神经网络层。可以利用编码器将目标文本的音素转化为隐含层表示序列(也可以称之为特征向量),即将目标文本的音素映射成中间隐式表示H,对于每个音素将生成一个特征向量,这些特征向量包含丰富的音素上下环境信息。
具体的,编码器可以将目标文本的音素对应的ID序列编码为M个特征向量(或者称之为嵌入embedding向量),其中,特征向量可以包括目标文本的音素的抽象内容信息。
为了能够得到可以包括更丰富的特征的音频特征,在利用编码器将目标文本的音素转化为特征向量之后,可以对编码器得到的特征向量进行韵律预测,其中,韵律预测可以包括三个部分:时长预测、音高预测和能量预测,分别用来表征音素的时长信息、基频信息和能量信息。示例性的,三个能量预测可以均采用相同的结构来实现,例如结构可以包括2个卷积层(384个3*1的卷积核)以及一层隐含层数目为384的全连接层,另外音高预测和能量预测还可以额外增加一层卷积层(512个9*1卷积核),他们均以H作为输入,分别预测各音素的对应的时长信息D,基频信息F和能量信息E。
其中,时长信息可以指各个音素所对应的帧数,所谓各个音素所对应的帧数,是指各个音素在多少个帧内演示,例如第一音素用于在N帧内演示,所述第二音素用于在M帧内演示。在得到时长信息之后,可以对各音素的特征向量H、基频信息F和能量信息E进行上采样处理,以得到各音素的各个帧的特征向量H、基频信息F和能量信息E。
之后,可以根据各音素的各个帧的特征向量H、基频信息F和能量信息H,确定各音素的各个帧的音频特征。例如可以将各音素的各个帧的特征向量H、基频信息F和能量信息H的向量之和作为各个音素的各个帧的音频特征,进而可以获取到所述第一音素的第一音频特征、以及所述第二音素的第二音频特征。
应理解,上述音频特征(包括第一音频特征和第二音频特征)可以为根据各个帧的特征向量H、基频信息F和能量信息H中的至少一种得到的,且得到的方式也不局限于加和,例如可以是结合权重的加和,或者是通过其他数学运算、神经网络得到,本申请并不限定。
本申请实施例中,可以获取到第一音素的第一音频特征以及第二音素的第二音频特征,第一音频特征可以包括第一音素对应的各个帧的音频特征,第二音频特征可以包括第二音素对应的各个帧的音频特征,具体的,所述第一音素用于在N帧内演示,所述第一音频特征的数量为N,且N个第一音频特征中的每个音频特征对应于所述N帧中的一帧;所述第二音素用于在M帧内演示,所述第二音频特征的数量为M,且M个第二音频特征中的每个音频特征对应于所述M帧中的一帧。
403、通过目标循环神经网络RNN根据所述第一音频特征获取所述第一音素对应的第一语音数据,通过所述目标RNN根据所述第二音频特征获取所述第二音素对应的第二语音数 据;其中,所述获取所述第一音素对应的第一语音数据和所述获取所述第二音素对应的第二语音数据的步骤并行执行。
本申请实施例中,在得到所述第一音素的第一音频特征、以及所述第二音素的第二音频特征之后,可以根据所述第一音频特征和所述第二音频特征,通过目标循环神经网络RNN并行确定所述第一音素对应的第一语音数据以及所述第二音素对应的第二语音数据。
其中,步骤503的执行主体可以为电子设备或者服务器。具体的,电子设备可以获取目标文本,并对所述第一音素和所述第二音素进行特征提取,以获取所述第一音素的第一音频特征、以及所述第二音素的第二音频特征,根据所述第一音频特征和所述第二音频特征,通过目标循环神经网络RNN获取所述第一音素对应的第一语音数据以及所述第二音素对应的第二语音数据;或者,电子设备可以获取目标文本,并对所述第一音素和所述第二音素进行特征提取,以获取所述第一音素的第一音频特征、以及所述第二音素的第二音频特征,将所述第一音素的第一音频特征、以及所述第二音素的第二音频特征发送至服务器,服务器可以根据所述第一音频特征和所述第二音频特征,通过目标循环神经网络RNN获取所述第一音素对应的第一语音数据以及所述第二音素对应的第二语音数据;或者,电子设备可以获取目标文本,并将目标文本发送至服务器,服务器可以对所述第一音素和所述第二音素进行特征提取,以获取所述第一音素的第一音频特征、以及所述第二音素的第二音频特征,并根据所述第一音频特征和所述第二音频特征,通过目标循环神经网络RNN获取所述第一音素对应的第一语音数据以及所述第二音素对应的第二语音数据。
在一种可能的实现中,所述第一语音数据以及所述第二语音数据为梅尔频谱MEL或巴克谱Bark。
本申请实施例中,可以通过预训练好的RNN对所述第一音素的第一音频特征、以及所述第二音素的第二音频特征进行处理,以得到所述第一音素对应的第一语音数据以及所述第二音素对应的第二语音数据。具体的,可以通过预训练好的RNN对目标文本的音素进行处理,以得到目标文本的各个音素的语音数据,其中,各个音素的语音数据包括所述第一音素对应的第一语音数据以及所述第二音素对应的第二语音数据。
应理解,在对所述第一音素的第一音频特征、以及所述第二音素的第二音频特征进行处理的过程还可以包括除了RNN之外的其他网络结构,本申请并不限定。例如,可以利用Taco2的网络结构处理音素的音频特征,如依次包括LSTM、线性映射层(linear projection)和前置层(Pre-net)。
现有的实现中,在通过RNN处理音频特征以获取语音数据的过程中,针对于音素的各个帧以及音素之间的相邻帧,RNN的隐含层的输入不仅包括输入层处理当前帧的音频特征的输出,还包括隐含层处理上一帧的音频特征的输出,也就是说,对于每个音素单位,其内部采用自回归方式执行,不同音素间也同样采用自回归方式执行。示例性的,第一目标音频特征x
t-1和第二目标音频特征x
t为不同音素的相邻帧的音频特征,RNN在处理第二目标音频特征x
t时,RNN的输入层处理第二目标音频特征x
t得到的结果可以作为RNN的隐含层的输入,同时RNN的隐含层在处理输入层处理第一目标音频特征x
t得到的结果后得到的隐含层输出s
t-1也作为RNN的隐含层的输入。相当于,对于每个音素单位,其内部采用自回归方式 执行,不同音素间也同样采用自回归方式执行。不同音素间采用自回归方式会大大增加RNN处理音频特征时所需的算力开销以及处理时间。
本申请实施例中,为了降低RNN处理音频特征时所需的算力开销以及处理时间,目标RNN可以并行处理第一音频特征以及第二音频特征;其中,所谓并行,是指目标RNN可以在根据第一音频特征计算第一语音数据的过程中,也在进行着根据第二音频特征计算第二语音数据的过程;
更具体的,目标RNN包括隐含层以及输出层,且第一音频特征和第二音频特征可以为多帧的音频特征,以第一目标音频特征为第一音频特征中最后一帧的音频特征、第二目标音频特征为第二音频特征中第一帧的音频特征为例,目标RNN计算语音数据的过程可以包括:
隐含层开始处理第一目标音频特征,隐含层计算得到第一子隐含层输出,输出层开始处理第一子隐含层输出,输出层计算得到语音数据;
隐含层开始处理第二目标音频特征,隐含层计算得到第二隐含层输出,输出层开始处理第二隐含层输出,输出层计算得到语音数据;
本实施例中,所谓并行,是指目标RNN的隐含层开始处理第二目标音频特征的时间在隐含层计算得到第一子隐含层输出之前,换一种表述方式,目标RNN的隐含层什么时候开始处理第二目标音频特征,并不依赖于隐含层完成第一子隐含层输出的计算,而依赖于第二目标音频特征的获取时间,在获取到第二目标音频特征之后,目标RNN的隐含层就可以直接开始处理第二目标音频特征;
为了实现目标RNN可以并行处理第一音频特征以及第二音频特征,可以在不同音素之间,采用非自回归的执行方式。
具体的,所述第一音素的时长为N帧,所述第一音频特征的数量为N,且N个第一音频特征中的每个音频特征对应于所述N帧中的一帧,所述N个第一音频特征包括第一目标音频特征,所述第一目标音频特征为所述N个第一音频特征中倒数第一帧的音频特征;所述第二音素用于在M帧内演示,所述第二音频特征的数量为M,且M个第二音频特征中的每个音频特征对应于所述M帧中的一帧,所述M个第二音频特征包括第二目标音频特征,所述第二目标音频特征为所述M个第二音频特征中正数第一帧的音频特征。也就是说,第一目标音频特征和第二目标音频特征为不同音素的相邻帧的音频特征。
所述N个第一音频特征还包括第三目标音频特征,所述第三目标音频特征为所述N个第一音频特征中倒数第二帧的音频特征,也就是说,第一目标音频特征和第三目标音频特征为相同音素的相邻帧的音频特征。
在RNN处理第三音频特征的过程中,所述隐含层可以根据所述第三目标音频特征确定第三子隐含层输出,具体的,所述隐含层可以根据RNN的输入层处理所述第三目标音频特征得到的输入层输出来确定第三子隐含层输出,所述输出层可以根据所述第三子隐含层输出确定所述第三子语音数据。其中,第三子语音数据可以为梅尔频谱MEL或巴克谱Bark。
在RNN处理第一音频特征的过程中,所述隐含层可以根据所述第一目标音频特征和所述第三子隐含层输出确定所述第一子隐含层输出,所述输出层可以根据所述第一隐含层输 出确定所述第一子语音数据。相当于,针对于同一个音素的各个帧,RNN的隐含层的输入不仅包括输入层处理当前帧的音频特征的输出,还包括隐含层处理上一帧的音频特征的输出,也就是说,对于每个音素单位,其内部采用自回归方式执行。
在RNN处理第二音频特征的过程中,所述隐含层还可以根据所述第二目标音频特征确定第二子隐含层输出,所述输出层可以根据所述第二子隐含层输出确定所述第二子语音数据,和现有的实现中,所述隐含层可以根据所述第二目标音频特征以及第一子隐含层输出来确定第二子隐含层输出不同的是,本实施例中,在所述隐含层确定所述第二子隐含层输出的过程中,所述第一子隐含层输出不作为所述隐含层的输入。
示例性的,可以参照图6,第一目标音频特征x
t-1和第二目标音频特征x
t为不同音素的相邻帧的音频特征,RNN在处理第二目标音频特征x
t时,RNN的输入层U处理第二目标音频特征x
t得到的结果可以作为RNN的隐含层的输入,同时RNN的隐含层在处理输入层U处理第一目标音频特征x
t得到的结果后得到的隐含层输出s
t-1不作为RNN的隐含层的输入。相当于,RNN在处理不同音素之间的两个相邻帧中靠后的一帧的音频特征时,隐含层的输入不包括隐含层处理上一帧的音频特征的输出,也就是说,对于不同的音素单位,不同音素间不采用自回归方式执行,从而降低了RNN处理音频特征时所需的算力开销以及处理时间。
应理解,本申请实施例并不限定RNN在处理第二目标音频特征时,RNN的隐含层的输入仅包括RNN的输入层处理第二目标音频特征得到的结果。
本申请实施例中,在得到各个音素的各个帧的语音数据之后,可以将各语音数据按帧数的前后顺序拼接得到语音数据处理结果,语音数据处理结果还可以经过后置网络(Post-net)进行补偿。
在一种可能的实现中,所述隐含层可以在确定出所述第一子隐含层输出之前,根据所述第二音频特征确定所述第二子隐含层输出。
在现有的实现中,在通过RNN处理音频特征以获取语音数据的过程中,针对于不同音素之间的相邻帧,RNN的隐含层的输入不仅包括输入层处理当前帧的音频特征的输出,还包括隐含层处理上一帧的音频特征的输出。因此,RNN在处理不同音素之间的两个相邻帧中靠后的一帧的音频特征时,需要等待隐含层处理上一帧的音频特征并得到隐含层输出之后,才可以进行当前帧的音频特征处理。
本申请实施例中,由于RNN在处理不同音素之间的两个相邻帧中靠后的一帧的音频特征时,隐含层的输入不包括隐含层处理上一帧的音频特征的输出,进而,RNN在处理不同音素之间的两个相邻帧中靠后的一帧的音频特征时,不需要等待隐含层处理上一帧的音频特征并得到隐含层输出之后,就可以进行当前帧的音频特征处理。也就是说,隐含层可以用于在确定出所述第一子隐含层输出之前,就根据所述第二音频特征确定所述第二子隐含层输出,从而进一步降低了RNN处理音频特征过程的时间开销。
404、根据所述第一语音数据和所述第二语音数据,通过声码器获取所述第一音素和所述第二音素对应的音频。
步骤404的执行主体可以是电子设备也可以是服务器。
本申请实施例中,在得到第一语音数据以及第二语音数据之后,可以以第一语音数据以及第二语音数据作为声码器的输入,输出音频。具体的,在得到目标文本对应的语音数据之后,可以将语音数据作为声码器的输入,进而输出目标文本的音频,该音频包括所述第一音素和所述第二音素对应的音频。
本申请实施例提供了一种文本数据处理方法,包括:获取目标文本,所述目标文本的音素包括相邻的第一音素和第二音素;对所述第一音素和所述第二音素进行特征提取,以获取所述第一音素的第一音频特征、以及所述第二音素的第二音频特征;通过目标循环神经网络RNN根据所述第一音频特征获取所述第一音素对应的第一语音数据,通过所述目标RNN根据所述第二音频特征获取所述第二音素对应的第二语音数据;其中,所述获取所述第一音素对应的第一语音数据和所述获取所述第二音素对应的第二语音数据的步骤并行执行;根据所述第一语音数据和所述第二语音数据,通过声码器获取所述第一音素和所述第二音素对应的音频。通过上述方式,目标RNN可以并行处理第一音频特征和第二音频特征,即实现了第一音频特征和第二音频特征的处理过程的解耦,减少了目标RNN处理音频特征的时长。
接下来描述如何通过模型训练来得到上述实施例中的TTS模型(包括RNN以及用于进行特征提取的网络)。
语音合成技术(也就是基于目标文本得到对应的音频的技术)根据服务的提供方式可划分为基于云端引擎的语音合成(可简称“在线语音合成”)和基于本地引擎的语音合成(简称为“离线语音合成”)两种,在线语音具有高自然度、高实时性和不占用客户端设备资源等特点,但是其缺点也很明显,由于使用语音合成的应用(application,APP)可以一次性发送大段文本到服务器,但是服务器合成的语音数据是分段发回到安装上述APP的客户端的,而语音的数据量即使经过压缩也相对较大,如果网络环境的不稳定,在线合成将变得非常缓慢而无法实现连贯的合成;离线合成则可脱离对网络的依赖,能够保证合成服务的稳定性以及用户隐私。离线合成对模型本身提出更高的要求,需要模型运行速度较快,能够在终端设备(如手机、音响、大屏等IoT设备中)上实时运行,同时模型及软件包大小占用储存空间较小(比如小于30MB),不会明显增加端测设备负担。为了不影响体验,离线合成模型应同云侧TTS的音质相近,不会带来明显的用户体验下降。为了将离线TTS的模型轻量化,从而能在终端设备上实时运行,可以通过知识蒸馏的方式。
本申请实施例中,所述目标RNN为根据老师RNN对学生RNN进行知识蒸馏得到的。
首先可以进行全量模型训练,也就是训练得到数据处理精度较高的老师TTS(包括老师RNN以及老师特征提取网络),之后可以基于老师TTS对学生TTS(包括学生RNN以及学生特征提取网络)执行知识蒸馏训练,以得到本申请实施例中的压缩后的TTS(包括目标RNN以及目标特征提取网络)。
在进行知识蒸馏的过程中,训练损失的构建具体包括但不限于以下三种方式:
在一种可能的实现中,所述目标RNN为根据老师RNN以及第一目标损失,通过对学生RNN进行知识蒸馏得到的;所述第一目标损失指示第一输出和第二输出之间的差异;其中, 所述第一输出为所述老师RNN的输出层的输出,所述第二输出为所述学生RNN的输出层的输出。
本申请实施例中,可以基于RNN的输出层输出的语音数据(例如Mel谱或BARK谱)来构建损失,例如可以采用基于Mel谱图的蒸馏方式(mel-spectrogram distillation,MSD),以此强制学生TTS模型能够学习到老师TTS的最终输出。
在一种可能的实现中,所述目标RNN为根据老师RNN以及第一目标损失,通过对学生RNN进行知识蒸馏得到的;所述第一目标损失指示第一输出和第二输出之间的差异;其中,所述第一输出为所述老师RNN的中间层的输出,所述第二输出为所述学生RNN的中间层的输出。
在一种可能的实现中,可以通过目标特征提取网络对所述第一音素和所述第二音素进行处理,以获取所述第一音素的第一音频特征、以及所述第二音素的第二音频特征;所述目标特征提取网络为根据老师特征提取网络以及第二目标损失,通过对学生特征提取网络进行知识蒸馏得到的;所述第二目标损失与第三输出和第四输出之间的差异有关;其中,所述第三输出为所述老师特征提取网络的中间层的输出,所述第四输出为所述学生特征提取网络的中间层的输出。
示例性的,本申请实施例中,可以采用基于中间特征表示的蒸馏方式(intermediate representation distillation,IRD),知识蒸馏所采用的损失为
其中Ks和KT分别是学生TTS模型和老师TTS模型的中间层(可以是RNN的中间层或者是特征提取网络的中间层)的输出,而Wi则为学生TTS模型中第i层需学习的参数,以此来强制学生TTS模型的各层输出同老师TTS模型的输出结果保持足够相近。
在一种可能的实现中,可以通过目标特征提取网络对所述第一音素和所述第二音素进行处理,以获取所述第一音素的第一音频特征、以及所述第二音素的第二音频特征;所述目标特征提取网络为根据老师特征提取网络以及第二目标损失,通过对学生特征提取网络进行知识蒸馏得到的;所述第二目标损失指示第三输出和第四输出之间的差异;其中,所述第三输出为所述老师特征提取网络的输出层的输出,所述第四输出为所述学生特征提取网络的输出层的输出。
本申请实施例中,可以采用基于韵律的蒸馏方式(prosody distillation,PD)来强制学生TTS模型能够学习到老师TTS模型的韵律预测结果。知识蒸馏所采用的损失函数可以是:
式中
及
分别表示学生TTS模型和老师TTS模型预测的时长、音高及能量之间的二阶范数,
和
表示音高和能量预测模块最后一层卷积层权重,W
f和W
θ表示可训练的矩阵用于对齐参数维 度。
本申请实施例提出一种模型蒸馏方法,先训练一个全量的老师TTS模型;继而设计一个较小模型大小的学生TTS模型,可以采用多种蒸馏方式来训练学生TTS模型,包括但不限于基于Mel谱图的蒸馏方式MSD、基于中间特征表示的蒸馏方式IRD及基于韵律的蒸馏方式PD等。
接下来从软件模块的角度来描述本申请实施例提供的文本数据处理方法。
参照图7,本申请实施例可以包括文本获取处理模块,可用于获取待处理的目标文本。并对目标文本进行预处理,预处理可以包括文本分析,文本分析可以是句法分析以得到文本特征,文本特征可以包括但不限于:音子序列、词性、词长以及韵律停顿。具体可以参照上述实施例中步骤401的描述,这里不再赘述。
编码模块,可用于对处理后的文本数据进行编码,得到特征向量表示。韵律预测模块,可用于预测时长、音高及能量,其中,韵律预测模块可以包括时长预测模块、音高预测模块以及能量预测模块,时长预测模块可用于根据编码输出的特征向量做出时长预测,音高预测模块可用于根据编码输出的特征向量做出音高预测,能量预测模块可用于根据编码输出的特征向量做出能量预测。编码模块可输出音频特征。具体可以参照上述实施例中步骤402的描述,这里不再赘述。
自回归模块可将时长模块、音高模块、能量模块三个输出的叠加,通过自回归的方式输出对应的谱图特征(上述实施例中称之为语音数据),具体可以参照上述实施例中步骤403的描述,这里不再赘述。声码器模块可将自回归模块的输出转为声音波形(上述实施例中称之为音频)。具体可以参照上述实施例中步骤404的描述,这里不再赘述。
更具体的,可以参照图8,其中,编码模块可以对输入的目标文本的音素序列(phoneme sequence X)进行编码,以得到隐藏层表示(Hidden representations H),之后对隐藏层表示进行韵律预测(Prosody injector),其中,韵律预测包括时长预测模块、音高预测模块以及能量预测模块,韵律预测的输出为音频特征(Sum G),其中,音频特征包括各个帧的音频特征(g1、…、gn、…、gN),之后自回归模块(例如图8中示出的串行连接的LUP层(其维度为512)、3层512个卷积核kernel大小为5的filter及1层隐含层为512的双向循环神经网络层)可以对音频特征进行处理以得到语音数据(Y1、…、Yn、…、YN),在对语音数据进行补偿之后,可以得到处理后的语音数据(例如图8中示出的MEL谱)。
接下来从模型训练和推理的角度描述本申请实施例的应用架构。
参照图9,在模型训练阶段(图9中描述为声学模型训练),可以基于TTS训练语料(TTS training corpus)训练得到老师TTS模型(例如图9中的teacher SAR声学模型),之后对老师TTS模型进行知识蒸馏得到目标TTS模型(例如图9中的small SAR声学模型),基于训练得到的目标TTS模型,可以进行在线的语音合成,具体的,可以获取输入文本(也就是上述实施例中的目标文本),对获取的目标文本进行前端处理得到文本特征,并基于目标TTS模型对文本特征进行处理(图9中描述为声学特征解码)以得到语音数据(图9中描述为声学特征),并基于声学特征进行音频的合成。
接下来从装置的角度对本申请实施例提供的文本处理装置进行描述,参照图10,图10为本申请实施例提供的一种文本处理装置1000的示意,如图10中示出的那样,本申请实施例提供的一种文本处理装置1000,包括:
获取模块1001,用于获取目标文本,所述目标文本的音素包括相邻的第一音素和第二音素;
关于获取模块1001的具体描述可以参照步骤401的描述,这里不再赘述。
特征提取模块1002,用于对所述第一音素和所述第二音素进行特征提取,以获取所述第一音素的第一音频特征、以及所述第二音素的第二音频特征;
关于特征提取模块1002的具体描述可以参照步骤402的描述,这里不再赘述。
语音数据提取模块1003,用于通过目标循环神经网络RNN根据所述第一音频特征获取所述第一音素对应的第一语音数据,通过所述目标RNN根据所述第二音频特征获取所述第二音素对应的第二语音数据;其中,所述获取所述第一音素对应的第一语音数据和所述获取所述第二音素对应的第二语音数据的步骤并行执行;
关于语音数据提取模块1003的具体描述可以参照步骤403的描述,这里不再赘述。
音频提取模块1004,用于根据所述第一语音数据和所述第二语音数据,通过声码器获取所述第一音素和所述第二音素对应的音频。
关于音频提取模块1004的具体描述可以参照步骤404的描述,这里不再赘述。
在一种可能的实现中,所述目标RNN包括隐含层和输出层,所述语音数据提取模块,用于通过所述隐含层根据所述第一音频特征确定第一隐含层输出;
通过所述输出层根据所述第一隐含层输出确定所述第一语音数据;
通过所述隐含层根据所述第二音频特征确定第二隐含层输出;
通过所述输出层根据所述第二隐含层输出确定所述第二语音数据,其中,所述隐含层确定第二隐含层输出的过程中,所述第一隐含层输出不作为所述隐含层的输入。
在一种可能的实现中,所述第一音素的时长为N帧,所述第一音频特征的数量为N,且N个第一音频特征中的每个音频特征对应于所述N帧中的一帧,所述N个第一音频特征包括第一目标音频特征和第三目标音频特征,所述第一目标音频特征对应的帧为所述第三目标音频特征对应的帧之前相邻的帧;所述第一语音数据包括所述第一目标音频特征对应的第一子语音数据以及所述第三目标音频特征对应的第三子语音数据;
所述语音数据提取模块,用于
通过所述隐含层根据所述第三目标音频特征确定第三子隐含层输出;
通过所述隐含层根据所述第一目标音频特征和所述第三子隐含层输出确定第一子隐含层输出;
所述输出层根据所述第三子隐含层输出确定所述第三子语音数据,
所述输出层根据所述第一子隐含层输出确定所述第一子语音数据。
在一种可能的实现中,所述第一音频特征包括如下信息的至少一种:所述第一音素的 基频信息或能量信息,所述第二音频特征包括如下信息的至少一种:所述第二音素的基频信息或能量信息。
在一种可能的实现中,所述第一语音数据以及所述第二语音数据为梅尔频谱MEL或巴克谱Bark。
在一种可能的实现中,所述目标RNN为根据老师RNN对学生RNN进行知识蒸馏得到的。
在一种可能的实现中,所述目标RNN为根据老师RNN以及第一目标损失,通过对学生RNN进行知识蒸馏得到的;所述第一目标损失指示第一输出和第二输出之间的差异;其中,
所述第一输出为所述老师RNN的输出层的输出,所述第二输出为所述学生RNN的输出层的输出;或,
所述第一输出为所述老师RNN的中间层的输出,所述第二输出为所述学生RNN的中间层的输出。
在一种可能的实现中,所述特征提取模块,用于通过目标特征提取网络对所述第一音素和所述第二音素进行处理,以获取所述第一音素的第一音频特征、以及所述第二音素的第二音频特征;所述目标特征提取网络为根据老师特征提取网络以及第二目标损失,通过对学生特征提取网络进行知识蒸馏得到的;所述第二目标损失指示第三输出和第四输出之间的差异;其中,
所述第三输出为所述老师特征提取网络的输出层的输出,所述第四输出为所述学生特征提取网络的输出层的输出;或,
所述第三输出为所述老师特征提取网络的中间层的输出,所述第四输出为所述学生特征提取网络的中间层的输出。
接下来介绍本申请实施例提供的一种执行设备,请参阅图11,图11为本申请实施例提供的执行设备的一种结构示意图,执行设备1100具体可以表现为手机、平板、笔记本电脑、智能穿戴设备、服务器等,此处不做限定。其中,执行设备1100上可以部署有图11对应实施例中所描述的数据处理装置,用于实现图11对应实施例中数据处理的功能。具体的,执行设备1100包括:接收器1101、发射器1102、处理器1103和存储器1104(其中执行设备1100中的处理器1103的数量可以一个或多个,图11中以一个处理器为例),其中,处理器1103可以包括应用处理器11031和通信处理器11032。在本申请的一些实施例中,接收器1101、发射器1102、处理器1103和存储器1104可通过总线或其它方式连接。
存储器1104可以包括只读存储器和随机存取存储器,并向处理器1103提供指令和数据。存储器1104的一部分还可以包括非易失性随机存取存储器(non-volatile random access memory,NVRAM)。存储器1104存储有处理器和操作指令、可执行模块或者数据结构,或者它们的子集,或者它们的扩展集,其中,操作指令可包括各种操作指令,用于实现各种操作。
处理器1103控制执行设备的操作。具体的应用中,执行设备的各个组件通过总线系统耦合在一起,其中总线系统除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见,在图中将各种总线都称为总线系统。
上述本申请实施例揭示的方法可以应用于处理器1103中,或者由处理器1103实现。处理器1103可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器1103中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器1103可以是通用处理器、数字信号处理器(digital signal processing,DSP)、微处理器或微控制器,还可进一步包括专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。该处理器1103可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器1104,处理器1103读取存储器1104中的信息,结合其硬件完成上述方法的步骤。
接收器1101可用于接收输入的数字或字符信息,以及产生与执行设备的相关设置以及功能控制有关的信号输入。发射器1102可用于通过第一接口输出数字或字符信息;发射器1102还可用于通过第一接口向磁盘组发送指令,以修改磁盘组中的数据;发射器1102还可以包括显示屏等显示设备。
本申请实施例中,在一种情况下,处理器1103,用于执行图4对应实施例中的文本数据处理方法。
本申请实施例还提供了一种训练设备,请参阅图12,图12是本申请实施例提供的训练设备一种结构示意图,具体的,训练设备1200由一个或多个服务器实现,训练设备1200可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上中央处理器(central processing units,CPU)1212(例如,一个或一个以上处理器)和存储器1232,一个或一个以上存储应用程序1242或数据1244的存储介质1230(例如一个或一个以上海量存储设备)。其中,存储器1232和存储介质1230可以是短暂存储或持久存储。存储在存储介质1230的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对训练设备中的一系列指令操作。更进一步地,中央处理器1212可以设置为与存储介质1230通信,在训练设备1200上执行存储介质1230中的一系列指令操作。
训练设备1200还可以包括一个或一个以上电源1226,一个或一个以上有线或无线网络接口1250,一个或一个以上输入输出接口1258;或,一个或一个以上操作系统1241,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM等等。
具体的,训练设备可以执行上述实施例中与模型训练相关的步骤。
本申请实施例中还提供一种包括计算机程序产品,当其在计算机上运行时,使得计算机执行如前述执行设备所执行的步骤,或者,使得计算机执行如前述训练设备所执行的步骤。
本申请实施例中还提供一种计算机可读存储介质,该计算机可读存储介质中存储有用于进行信号处理的程序,当其在计算机上运行时,使得计算机执行如前述执行设备所执行的步骤,或者,使得计算机执行如前述训练设备所执行的步骤。
本申请实施例提供的执行设备、训练设备或终端设备具体可以为芯片,芯片包括:处理单元和通信单元,所述处理单元例如可以是处理器,所述通信单元例如可以是输入/输出接口、管脚或电路等。该处理单元可执行存储单元存储的计算机执行指令,以使执行设备内的芯片执行上述实施例描述的数据处理方法,或者,以使训练设备内的芯片执行上述实施例描述的数据处理方法。可选地,所述存储单元为所述芯片内的存储单元,如寄存器、缓存等,所述存储单元还可以是所述无线接入设备端内的位于所述芯片外部的存储单元,如只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)等。
具体的,请参阅图13,图13为本申请实施例提供的芯片的一种结构示意图,所述芯片可以表现为神经网络处理器NPU 1300,NPU 1300作为协处理器挂载到主CPU(Host CPU)上,由Host CPU分配任务。NPU的核心部分为运算电路1303,通过控制器1304控制运算电路1303提取存储器中的矩阵数据并进行乘法运算。
在一些实现中,运算电路1303内部包括多个处理单元(Process Engine,PE)。在一些实现中,运算电路1303是二维脉动阵列。运算电路1303还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路1303是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器1302中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器1301中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)1308中。
统一存储器1306用于存放输入数据以及输出数据。权重数据直接通过存储单元访问控制器(Direct Memory Access Controller,DMAC)1305,DMAC被搬运到权重存储器1302中。输入数据也通过DMAC被搬运到统一存储器1306中。
BIU为Bus Interface Unit即,总线接口单元1310,用于AXI总线与DMAC和取指存储器(Instruction Fetch Buffer,IFB)1309的交互。
总线接口单元1310(Bus Interface Unit,简称BIU),用于取指存储器1309从外部存储器获取指令,还用于存储单元访问控制器1305从外部存储器获取输入矩阵A或者权重矩阵B的原数据。
DMAC主要用于将外部存储器DDR中的输入数据搬运到统一存储器1306或将权重数据搬运到权重存储器1302中或将输入数据数据搬运到输入存储器1301中。
向量计算单元1307包括多个运算处理单元,在需要的情况下,对运算电路1303的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。主要用于神经网络中非卷积/全连接层网络计算,如Batch Normalization(批归一化),像素级求和,对特征平面进行上采样等。
在一些实现中,向量计算单元1307能将经处理的输出的向量存储到统一存储器1306。例如,向量计算单元1307可以将线性函数;或,非线性函数应用到运算电路1303的输出,例如对卷积层提取的特征平面进行线性插值,再例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元1307生成归一化的值、像素级求和的值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路1303的激活输入,例如用于在神经网络中的后续层中的使用。
控制器1304连接的取指存储器(instruction fetch buffer)1309,用于存储控制器1304使用的指令;
统一存储器1306,输入存储器1301,权重存储器1302以及取指存储器1309均为On-Chip存储器。外部存储器私有于该NPU硬件架构。
其中,上述任一处提到的处理器,可以是一个通用中央处理器,微处理器,ASIC,或一个或多个用于控制上述程序执行的集成电路。
另外需说明的是,以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本申请提供的装置实施例附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件的方式来实现,当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现。一般情况下,凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现,而且,用来实现同一功能的具体硬件结构也可以是多种多样的,例如模拟电路、数字电路或专用电路等。但是,对本申请而言更多情况下软件程序实现是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在可读取的存储介质中,如计算机的软盘、U盘、移动硬盘、ROM、RAM、磁碟或者光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,训练设备,或者网络设备等)执行本申请各个实施例所述的方法。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、训练设备或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、训练设备或数据中心进行传输。所述计算机可读存储介质可以 是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的训练设备、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(Solid State Disk,SSD))等。
Claims (19)
- 一种文本数据处理方法,其特征在于,包括:获取目标文本,所述目标文本的音素包括相邻的第一音素和第二音素;对所述第一音素和所述第二音素进行特征提取,以获取所述第一音素的第一音频特征、以及所述第二音素的第二音频特征;通过目标循环神经网络RNN根据所述第一音频特征获取所述第一音素对应的第一语音数据,通过所述目标RNN根据所述第二音频特征获取所述第二音素对应的第二语音数据;其中,所述获取所述第一音素对应的第一语音数据和所述获取所述第二音素对应的第二语音数据的步骤并行执行;根据所述第一语音数据和所述第二语音数据,通过声码器获取所述第一音素和所述第二音素对应的音频。
- 根据权利要求1所述的方法,其特征在于,所述目标RNN包括隐含层和输出层,所述通过目标循环神经网络RNN根据所述第一音频特征获取所述第一音素对应的第一语音数据,通过所述目标RNN根据所述第二音频特征获取所述第二音素对应的第二语音数据,包括:通过所述隐含层根据所述第一音频特征确定第一隐含层输出;通过所述输出层根据所述第一隐含层输出确定所述第一语音数据;通过所述隐含层根据所述第二音频特征确定第二隐含层输出;通过所述输出层根据所述第二隐含层输出确定所述第二语音数据,其中,所述隐含层确定第二隐含层输出的过程中,所述第一隐含层输出不作为所述隐含层的输入。
- 根据权利要求1或2所述的方法,其特征在于,所述第一音素的时长为N帧,所述第一音频特征的数量为N,且N个第一音频特征中的每个音频特征对应于所述N帧中的一帧,所述N个第一音频特征包括第一目标音频特征和第三目标音频特征,所述第一目标音频特征对应的帧为所述第三目标音频特征对应的帧之前的帧;所述第一语音数据包括所述第一目标音频特征对应的第一子语音数据以及所述第三目标音频特征对应的第三子语音数据;所述通过所述隐含层根据所述第一音频特征确定第一隐含层输出包括:通过所述隐含层根据所述第三目标音频特征确定第三子隐含层输出;通过所述隐含层根据所述第一目标音频特征和所述第三子隐含层输出确定第一子隐含层输出;所述通过所述输出层根据所述第一隐含层输出确定所述第一语音数据包括:通过所述输出层根据所述第三子隐含层输出确定所述第三子语音数据,通过所述输出层根据所述第一子隐含层输出确定所述第一子语音数据。
- 根据权利要求1至3任一所述的方法,其特征在于,所述第一音频特征包括如下信息 的至少一种:所述第一音素的基频信息或能量信息,所述第二音频特征包括如下信息的至少一种:所述第二音素的基频信息或能量信息。
- 根据权利要求1至4任一所述的方法,其特征在于,所述第一语音数据以及所述第二语音数据为梅尔频谱MEL或巴克谱Bark。
- 根据权利要求1至5任一所述的方法,其特征在于,所述目标RNN为根据老师RNN对学生RNN进行知识蒸馏得到的。
- 根据权利要求6所述的方法,其特征在于,所述目标RNN为根据老师RNN以及第一目标损失,通过对学生RNN进行知识蒸馏得到的;所述第一目标损失指示第一输出和第二输出之间的差异;其中,所述第一输出为所述老师RNN的输出层的输出,所述第二输出为所述学生RNN的输出层的输出;或,所述第一输出为所述老师RNN的中间层的输出,所述第二输出为所述学生RNN的中间层的输出。
- 根据权利要求1至7任一所述的方法,其特征在于,所述对所述第一音素和所述第二音素进行特征提取,包括:通过目标特征提取网络对所述第一音素和所述第二音素进行处理,以获取所述第一音素的第一音频特征、以及所述第二音素的第二音频特征;所述目标特征提取网络为根据老师特征提取网络以及第二目标损失,通过对学生特征提取网络进行知识蒸馏得到的;所述第二目标损失指示第三输出和第四输出之间的差异;其中,所述第三输出为所述老师特征提取网络的输出层的输出,所述第四输出为所述学生特征提取网络的输出层的输出;或,所述第三输出为所述老师特征提取网络的中间层的输出,所述第四输出为所述学生特征提取网络的中间层的输出。
- 一种文本数据处理装置,其特征在于,包括:获取模块,用于获取目标文本,所述目标文本的音素包括相邻的第一音素和第二音素;特征提取模块,用于对所述第一音素和所述第二音素进行特征提取,以获取所述第一音素的第一音频特征、以及所述第二音素的第二音频特征;语音数据提取模块,用于通过目标循环神经网络RNN根据所述第一音频特征获取所述第一音素对应的第一语音数据,通过所述目标RNN根据所述第二音频特征获取所述第二音素对应的第二语音数据;其中,所述获取所述第一音素对应的第一语音数据和所述获取所述第二音素对应的第二语音数据的步骤并行执行;音频提取模块,用于根据所述第一语音数据和所述第二语音数据,通过声码器获取所 述第一音素和所述第二音素对应的音频。
- 根据权利要求9所述的装置,其特征在于,所述目标RNN包括隐含层和输出层,所述语音数据提取模块,用于通过所述隐含层根据所述第一音频特征确定第一隐含层输出;通过所述输出层根据所述第一隐含层输出确定所述第一语音数据;通过所述隐含层根据所述第二音频特征确定第二隐含层输出;通过所述输出层根据所述第二隐含层输出确定所述第二语音数据,其中,所述隐含层确定第二隐含层输出的过程中,所述第一隐含层输出不作为所述隐含层的输入。
- 根据权利要求9或10所述的装置,其特征在于,所述第一音素的时长为N帧,所述第一音频特征的数量为N,且N个第一音频特征中的每个音频特征对应于所述N帧中的一帧,所述N个第一音频特征包括第一目标音频特征和第三目标音频特征,所述第一目标音频特征对应的帧为所述第三目标音频特征对应的帧之前相邻的帧;所述第一语音数据包括所述第一目标音频特征对应的第一子语音数据以及所述第三目标音频特征对应的第三子语音数据;所述语音数据提取模块,用于:通过所述隐含层根据所述第三目标音频特征确定第三子隐含层输出;通过所述隐含层根据所述第一目标音频特征和所述第三子隐含层输出确定第一子隐含层输出;通过所述输出层根据所述第三子隐含层输出确定所述第三子语音数据,通过所述输出层根据所述第一子隐含层输出确定所述第一子语音数据。
- 根据权利要求9至11任一所述的装置,其特征在于,所述第一音频特征包括如下信息的至少一种:所述第一音素的基频信息或能量信息,所述第二音频特征包括如下信息的至少一种:所述第二音素的基频信息或能量信息。
- 根据权利要求9至12任一所述的装置,其特征在于,所述第一语音数据以及所述第二语音数据为梅尔频谱MEL或巴克谱Bark。
- 根据权利要求9至13任一所述的装置,其特征在于,所述目标RNN为根据老师RNN对学生RNN进行知识蒸馏得到的。
- 根据权利要求14所述的装置,其特征在于,所述目标RNN为根据老师RNN以及第一目标损失,通过对学生RNN进行知识蒸馏得到的;所述第一目标损失指示第一输出和第二输出之间的差异;其中,所述第一输出为所述老师RNN的输出层的输出,所述第二输出为所述学生RNN的输出层的输出;或,所述第一输出为所述老师RNN的中间层的输出,所述第二输出为所述学生RNN的中间层的输出。
- 根据权利要求9至15任一所述的装置,其特征在于,所述特征提取模块,用于通过目标特征提取网络对所述第一音素和所述第二音素进行处理,以获取所述第一音素的第一音频特征、以及所述第二音素的第二音频特征;所述目标特征提取网络为根据老师特征提取网络以及第二目标损失,通过对学生特征提取网络进行知识蒸馏得到的;所述第二目标损失指示第三输出和第四输出之间的差异;其中,所述第三输出为所述老师特征提取网络的输出层的输出,所述第四输出为所述学生特征提取网络的输出层的输出;或,所述第三输出为所述老师特征提取网络的中间层的输出,所述第四输出为所述学生特征提取网络的中间层的输出。
- 一种文本数据处理装置,其特征在于,所述装置包括存储器和处理器;所述存储器存储有代码,所述处理器被配置为执行所述代码,当所述代码被执行时,所述文本数据处理装置执行如权利要求1至8任一所述的方法。
- 一种计算机存储介质,其特征在于,所述计算机存储介质存储有一个或多个指令,所述指令在由一个或多个计算机执行时使得所述一个或多个计算机实施权利要求1至8任一所述的方法。
- 一种计算机程序产品,其特征在于,所述计算机程序产品包括代码,当所述代码被执行时,用于实现权利要求1至8任一项所述的方法的步骤。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP22742133.6A EP4270382A4 (en) | 2021-01-22 | 2022-01-18 | METHOD AND APPARATUS FOR PROCESSING TEXT DATA |
US18/356,738 US20230360634A1 (en) | 2021-01-22 | 2023-07-21 | Text data processing method and apparatus |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110091046.9A CN112885328B (zh) | 2021-01-22 | 2021-01-22 | 一种文本数据处理方法及装置 |
CN202110091046.9 | 2021-01-22 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US18/356,738 Continuation US20230360634A1 (en) | 2021-01-22 | 2023-07-21 | Text data processing method and apparatus |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022156654A1 true WO2022156654A1 (zh) | 2022-07-28 |
Family
ID=76050482
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/072441 WO2022156654A1 (zh) | 2021-01-22 | 2022-01-18 | 一种文本数据处理方法及装置 |
Country Status (4)
Country | Link |
---|---|
US (1) | US20230360634A1 (zh) |
EP (1) | EP4270382A4 (zh) |
CN (1) | CN112885328B (zh) |
WO (1) | WO2022156654A1 (zh) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112885328B (zh) * | 2021-01-22 | 2024-06-28 | 华为技术有限公司 | 一种文本数据处理方法及装置 |
CN113393832B (zh) * | 2021-06-03 | 2023-10-10 | 清华大学深圳国际研究生院 | 一种基于全局情感编码的虚拟人动画合成方法及系统 |
CN113421547B (zh) * | 2021-06-03 | 2023-03-17 | 华为技术有限公司 | 一种语音处理方法及相关设备 |
CN113516968B (zh) * | 2021-06-07 | 2022-05-20 | 北京邮电大学 | 一种端到端长时语音识别方法 |
CN113380222B (zh) * | 2021-06-09 | 2024-06-04 | 广州虎牙科技有限公司 | 语音合成方法、装置、电子设备及存储介质 |
WO2023184874A1 (zh) * | 2022-03-31 | 2023-10-05 | 美的集团(上海)有限公司 | 语音合成方法和装置 |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109754778A (zh) * | 2019-01-17 | 2019-05-14 | 平安科技(深圳)有限公司 | 文本的语音合成方法、装置和计算机设备 |
WO2020027619A1 (ko) * | 2018-08-02 | 2020-02-06 | 네오사피엔스 주식회사 | 순차적 운율 특징을 기초로 기계학습을 이용한 텍스트-음성 합성 방법, 장치 및 컴퓨터 판독가능한 저장매체 |
CN111583904A (zh) * | 2020-05-13 | 2020-08-25 | 北京字节跳动网络技术有限公司 | 语音合成方法、装置、存储介质及电子设备 |
WO2020190050A1 (ko) * | 2019-03-19 | 2020-09-24 | 휴멜로 주식회사 | 음성 합성 장치 및 그 방법 |
CN112002305A (zh) * | 2020-07-29 | 2020-11-27 | 北京大米科技有限公司 | 语音合成方法、装置、存储介质及电子设备 |
CN112233646A (zh) * | 2020-10-20 | 2021-01-15 | 携程计算机技术(上海)有限公司 | 基于神经网络的语音克隆方法、系统、设备及存储介质 |
CN112885328A (zh) * | 2021-01-22 | 2021-06-01 | 华为技术有限公司 | 一种文本数据处理方法及装置 |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106910497B (zh) * | 2015-12-22 | 2021-04-16 | 阿里巴巴集团控股有限公司 | 一种中文词语发音预测方法及装置 |
US11069335B2 (en) * | 2016-10-04 | 2021-07-20 | Cerence Operating Company | Speech synthesis using one or more recurrent neural networks |
CN110751260B (zh) * | 2018-07-24 | 2024-08-20 | 北京三星通信技术研究有限公司 | 电子设备、任务处理的方法以及训练神经网络的方法 |
CN111402857B (zh) * | 2020-05-09 | 2023-11-21 | 广州虎牙科技有限公司 | 语音合成模型训练方法和装置、电子设备及存储介质 |
-
2021
- 2021-01-22 CN CN202110091046.9A patent/CN112885328B/zh active Active
-
2022
- 2022-01-18 EP EP22742133.6A patent/EP4270382A4/en active Pending
- 2022-01-18 WO PCT/CN2022/072441 patent/WO2022156654A1/zh unknown
-
2023
- 2023-07-21 US US18/356,738 patent/US20230360634A1/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020027619A1 (ko) * | 2018-08-02 | 2020-02-06 | 네오사피엔스 주식회사 | 순차적 운율 특징을 기초로 기계학습을 이용한 텍스트-음성 합성 방법, 장치 및 컴퓨터 판독가능한 저장매체 |
CN109754778A (zh) * | 2019-01-17 | 2019-05-14 | 平安科技(深圳)有限公司 | 文本的语音合成方法、装置和计算机设备 |
WO2020190050A1 (ko) * | 2019-03-19 | 2020-09-24 | 휴멜로 주식회사 | 음성 합성 장치 및 그 방법 |
CN111583904A (zh) * | 2020-05-13 | 2020-08-25 | 北京字节跳动网络技术有限公司 | 语音合成方法、装置、存储介质及电子设备 |
CN112002305A (zh) * | 2020-07-29 | 2020-11-27 | 北京大米科技有限公司 | 语音合成方法、装置、存储介质及电子设备 |
CN112233646A (zh) * | 2020-10-20 | 2021-01-15 | 携程计算机技术(上海)有限公司 | 基于神经网络的语音克隆方法、系统、设备及存储介质 |
CN112885328A (zh) * | 2021-01-22 | 2021-06-01 | 华为技术有限公司 | 一种文本数据处理方法及装置 |
Non-Patent Citations (1)
Title |
---|
See also references of EP4270382A4 |
Also Published As
Publication number | Publication date |
---|---|
CN112885328B (zh) | 2024-06-28 |
US20230360634A1 (en) | 2023-11-09 |
EP4270382A4 (en) | 2024-05-15 |
EP4270382A1 (en) | 2023-11-01 |
CN112885328A (zh) | 2021-06-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2022156654A1 (zh) | 一种文本数据处理方法及装置 | |
WO2021135707A1 (zh) | 机器学习模型的搜索方法及相关装置、设备 | |
CN105654952B (zh) | 用于输出语音的电子设备、服务器和方法 | |
WO2021036568A1 (zh) | 辅助健身的方法和电子装置 | |
WO2022253061A1 (zh) | 一种语音处理方法及相关设备 | |
WO2022033556A1 (zh) | 电子设备及其语音识别方法和介质 | |
CN113539290B (zh) | 语音降噪方法和装置 | |
CN114242037A (zh) | 一种虚拟人物生成方法及其装置 | |
WO2021254411A1 (zh) | 意图识别方法和电子设备 | |
WO2022007895A1 (zh) | 图像帧的超分辨率实现方法和装置 | |
CN113297843B (zh) | 指代消解的方法、装置及电子设备 | |
WO2022179604A1 (zh) | 一种分割图置信度确定方法及装置 | |
CN112256868A (zh) | 零指代消解方法、训练零指代消解模型的方法及电子设备 | |
US20190348062A1 (en) | System and method for encoding data using time shift in an audio/image recognition integrated circuit solution | |
CN113468929A (zh) | 运动状态识别方法、装置、电子设备和存储介质 | |
WO2022062884A1 (zh) | 文字输入方法、电子设备及计算机可读存储介质 | |
WO2022143314A1 (zh) | 一种对象注册方法及装置 | |
CN111768765A (zh) | 语言模型生成方法和电子设备 | |
CN112308202A (zh) | 一种确定卷积神经网络的决策因素的方法及电子设备 | |
CN113506566B (zh) | 声音检测模型训练方法、数据处理方法以及相关装置 | |
CN110990549A (zh) | 获取答案的方法、装置、电子设备及存储介质 | |
WO2021237740A1 (zh) | 一种语音信号处理方法及其相关设备 | |
WO2022095983A1 (zh) | 一种防止手势误识别的方法及电子设备 | |
WO2022007757A1 (zh) | 跨设备声纹注册方法、电子设备及存储介质 | |
CN115641867A (zh) | 语音处理方法和终端设备 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22742133 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2022742133 Country of ref document: EP Effective date: 20230726 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |