CN111063335B - End-to-end tone recognition method based on neural network - Google Patents

End-to-end tone recognition method based on neural network Download PDF

Info

Publication number
CN111063335B
CN111063335B CN201911310349.4A CN201911310349A CN111063335B CN 111063335 B CN111063335 B CN 111063335B CN 201911310349 A CN201911310349 A CN 201911310349A CN 111063335 B CN111063335 B CN 111063335B
Authority
CN
China
Prior art keywords
tone
syllable
network
neural network
label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911310349.4A
Other languages
Chinese (zh)
Other versions
CN111063335A (en
Inventor
黄浩
王凯
胡英
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xinjiang University
Original Assignee
Xinjiang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xinjiang University filed Critical Xinjiang University
Priority to CN201911310349.4A priority Critical patent/CN111063335B/en
Publication of CN111063335A publication Critical patent/CN111063335A/en
Application granted granted Critical
Publication of CN111063335B publication Critical patent/CN111063335B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Human Computer Interaction (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses an end-to-end tone recognition method based on a neural network, which comprises the following steps: constructing an end-to-end tone recognition model, and determining each required hyper-parameter such as the number of layers of a neural network, the number of nodes of a hidden layer and the like; training a voice recognition acoustic model on a training set, and acquiring the starting time and the ending time of each syllable by using forced alignment; sending the selected training voice data and the tone label of each syllable into an end-to-end tone recognition model for training optimization to obtain optimized neural network model parameters; continuously adjusting the neural network model parameters, and selecting the optimal network model parameters; obtaining a test voice, and obtaining the starting time and the ending time of each syllable by using forced alignment under the condition of a given sentence content text; when not given, obtaining the start and end time of each syllable using automatic speech recognition; and sending the selected training voice data and the syllable time marks into an end-to-end tone recognition model for recognition, and finally obtaining the tone type of each syllable in each test data.

Description

End-to-end tone recognition method based on neural network
Technical Field
The invention relates to the field of Chinese Mandarin tone recognition, in particular to an end-to-end tone recognition method based on a neural network.
Background
With the rapid development of artificial intelligence technology, the research on speech technology is also continuously deepened, including the technical fields of speech recognition, speech synthesis, speech separation, speech conversion, speaker recognition and the like, in the related fields, the experimental result of the tonal language is greatly influenced by the tone of the speech through experiments, in the Chinese dialect, the tones are divided into five types, namely, yin-level, yang-level, sound rising, sound removing and no tone, and the Chinese speech is one-tone level (-v), two-tone rising (/), three-tone turning (/ -), four-tone falling (\\) and no-tone. The tone is a very important part in Mandarin, and if the tone is wrong, ambiguity can occur, so that the speech understanding is wrong, and the research on the tone of Mandarin is necessary. Tone recognition is an important research direction in the field of speech, and the main purpose of the method is to accurately obtain the tone of tonal language speech and improve the accuracy of tasks such as speech recognition and speech synthesis. Traditional tone recognition uses classical classification algorithms, namely feature extraction at the front end and a back-end classifier. The traditional tone classification includes two stages of obtaining fundamental frequency characteristics as characteristics and classifying the tones independently.
For the extraction method of the fundamental frequency features, a time domain analysis method, a frequency domain analysis method or a mixing method can be adopted, the time domain analysis method comprises an autocorrelation method, an average amplitude difference method and the like, and the frequency domain analysis method comprises a cepstrum method and the like. The methods are artificially designed heuristic fundamental frequency extraction algorithms, and the setting of the algorithms is manually set by depending on experimental phonetics for experience. For the rear-end tone classification model, when classifying the tones, a classifier model in the conventional pattern recognition is mainly used, such as a support vector machine model, a gaussian mixture model, a decision tree model, a gaussian mixture model-hidden markov model, a conditional random field model, or a neural network model.
The tone classification models are divided into two categories: a tone classification model based on frame features and a tone classification model based on segment features. The tone classification model based on the frame characteristics comprises a Gaussian mixture model-hidden Markov model and a conditional random field model; the tone classification model based on the segment characteristics comprises a support vector machine model, a Gaussian mixture model, a decision tree model and the like. The tone classification model based on the frame features can directly process the extracted fundamental frequency related features, and takes a variable-length sequence form as input to calculate the posterior probability of the tone model under the given input sequence condition for tone classification. The method based on the segment features can only process input feature vectors with fixed dimensions, so that a feature sequence related to fundamental frequency needs to be proposed firstly, then the fundamental frequency feature sequence is converted into a fixed dimension observation vector containing tone information by a manual method, then the fixed dimension observation vector is sent to a classifier based on the segment features to train a tone model, finally tone data to be tested is classified according to the tone model to obtain a correct recognition result, and the whole tone recognition process is completed.
The existing tone classification technology has the following two main problems:
on one hand, the traditional fundamental frequency extraction method is not perfect enough, and the extracted fundamental frequency value is not accurate enough. The inaccuracy of the fundamental frequency extraction causes that the obtained classification result is not accurate enough when tone classification is carried out subsequently;
on the other hand, when the pitch is classified only by using the fundamental frequency value obtained by the conventional fundamental frequency extraction method, the information helpful for the pitch classification cannot be completely contained only by using the manually designed fundamental frequency feature, and the final pitch classification result cannot necessarily reach the optimum. The traditional tone recognition method comprises two stages of feature extraction and classifier training, wherein each stage needs a large amount of tone parameters, and the two stages of tone recognition can not always achieve the overall optimal result.
Disclosure of Invention
The invention provides an end-to-end tone recognition method based on a neural network, which combines the traditional framework of extracting fundamental frequency features first and the later tone classification framework for learning to carry out end-to-end tone recognition and realize the accurate classification of tones, and the details are described as follows:
a neural network-based end-to-end tonal recognition method, the method comprising:
firstly, training a tone recognition system model:
constructing an end-to-end tone recognition model, and determining each required hyper-parameter such as the number of layers of a neural network, the number of nodes of a hidden layer and the like;
training a voice recognition acoustic model on a training set, and acquiring the starting time and the ending time of each syllable by using forced alignment;
sending the selected training voice data and the tone label of each syllable into an end-to-end tone recognition model for training optimization to obtain optimized neural network model parameters;
continuously adjusting parameters of the neural network model, and selecting optimal parameters of the network model;
secondly, tone recognition:
obtaining a test voice, and obtaining the starting time and the ending time of each syllable by using forced alignment under the condition of a given sentence content text; when not given, obtaining the start time and the end time of each syllable by using automatic voice recognition;
and sending the selected training voice data into an end-to-end tone recognition model for recognition, and finally obtaining the tone type of each syllable in each test data.
The method constructs a trainable deep neural network model, then combines a fundamental frequency extraction neural network with a tone decoding neural network to form an end-to-end neural network tone classification model, and the two parts of network parameters are trained and optimized at the same time in a training stage.
The fundamental frequency extraction neural network is an encoder-decoder based on a cyclic neural network, and the network is divided into a fundamental frequency encoder network and a fundamental frequency decoder network.
Further, the fundamental frequency encoder network encodes the voice by using a recurrent neural network, the fundamental frequency decoder network predicts a fundamental frequency label from the last frame of the voice, converts the predicted fundamental frequency label into a trainable fundamental frequency embedding vector according to the predicted fundamental frequency label, and predicts the fundamental frequency label at the previous moment through the fundamental frequency encoding at the previous moment until the prediction of the fundamental frequency label of the first frame is finished;
and after the tone label of each frame is predicted, converting the tone label into a fundamental frequency value sequence of the whole voice by utilizing the corresponding relation between the predefined label and the frequency.
Wherein the tone decoding neural network is divided into two parts: tone representation networks and label-dependent tone classification networks;
the tone representation network maps the predicted fundamental frequency sequence into a vector with fixed dimensionality according to each syllable;
the tone classification network predicts the tone type of the current syllable based on the predicted label of the last syllable and the fixed dimensional vector of the current syllable.
Further, the predicting, by the tone classification network, the tone type of the current syllable according to the predicted label of the previous syllable and the fixed dimension vector of the current syllable is specifically:
firstly, according to the fixed dimension representation of the 1 st syllable, splicing with the fixed dimension corresponding to the beginning of a sentence, and sending the fixed dimension representation and the fixed dimension representation into a tone classification network to predict the tone type of the 1 st syllable;
converting the predicted tone type of the 1 st syllable into a corresponding tone label and then sending the corresponding tone label and the fixed dimension representation of the 2 nd syllable as combined input into a tone classification network to obtain the tone type of the 2 nd syllable;
this is repeated until the tone of the last syllable is predicted.
The technical scheme provided by the invention has the beneficial effects that:
1. compared with the traditional tone recognition method with two independent stages (fundamental frequency feature extraction and tone classification), the method has the advantages that the defects of a manual design algorithm can be reduced by an end-to-end combination method, so that a better classification result can be obtained from a tone classification result;
2. the method improves the accuracy of mandarin tone recognition, breaks through the traditional frame that the tone recognition is divided into two stages (fundamental frequency feature extraction and tone classification), and constructs an end-to-end tone model. The end-to-end model enables the use of two networks for the two phases of traditional tone recognition: the method can bypass the link of manual design, and the network parameters of the whole model can be jointly optimized, so that the accuracy of the Chinese mandarin tone recognition can be obtained, and the method is suitable for processing the tone problem of the toned language;
3. aiming at the characteristic extraction network, the invention uses a fundamental frequency extraction network of an encoder-decoder, the fundamental frequency extraction network uses an encoder of a recurrent neural network to map the whole input sequence into a characteristic vector with fixed dimensionality, and deduces a fundamental frequency label corresponding to each frame according to the characteristic vector in a time reverse order;
4. a tone decoding network related to labels is adopted for a tone recognition network, and the decoding network not only uses the characteristics of a fundamental frequency extraction network of a current syllable when predicting a current tone label, but also uses a tone label value predicted by a previous tone, so that the influence between the tone types of the context is considered when tone recognition is carried out, and a better tone recognition result can be obtained.
Drawings
FIG. 1 is a general framework diagram of an end-to-end tonal identification network;
FIG. 2 is a diagram of a fundamental frequency feature representation network;
FIG. 3 is a diagram of a tone-embedding representation network;
fig. 4 is a diagram of a tone prediction network.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.
The invention identifies the tone category represented by each syllable pronunciation from isolated or continuous voice flow of Mandarin, provides an end-to-end tone identification method combining a fundamental frequency feature extraction network of a coder-Decoder (Encoder-Decoder) based on a recurrent neural network and a tone label related identification network, and forms a unified network model by a traditional fundamental frequency (Pitch or F0) feature acquisition architecture and a tone classification architecture, thereby realizing end-to-end tone identification without explicit fundamental frequency feature extraction.
The tone recognition has the functions of recognizing the tone in the voice and acquiring the tone information contained in the voice, so that the acquired tone information can meet the requirements on tasks such as voice recognition, voice synthesis, voice conversion and the like, the tasks can be realized more accurately, and meanwhile, for some people who learn the second tonal language, such as foreigners, which are not native language, to learn Chinese, the tone recognition can help to correct errors and improve the learning efficiency.
And (3) providing a network structure of an encoder-decoder in a fundamental frequency feature extraction network to consider the accuracy of fundamental frequency extraction, adopting a tone label related decoding network in a tone recognition network, finally combining the two networks together to form an integral network, and carrying out joint optimization on parameters. The method can enable the tone recognition result to be more accurate, has higher efficiency, and has good effect in scientific research or daily application.
Example 1
In order to solve the problems, the invention adopts an end-to-end tone recognition method based on a neural network, and the method converts the traditional two-stage classification problem of firstly extracting fundamental frequency relevant features and then carrying out tone classification into a single-stage network model to carry out parameter joint optimization, thereby realizing end-to-end tone recognition.
At present, an end-to-end-based method becomes a research hotspot of the current artificial intelligence technology, such as end-to-end voice recognition, end-to-end voice synthesis, end-to-end voice conversion and the like, and the end-to-end technology can reduce the artificial setting of experimental hyper-parameters and obtain better performance.
The invention overcomes the defect that the fundamental frequency extraction stage and the tone classification stage of the traditional two-stage tone recognition method are designed separately. An artificial heuristic algorithm of a traditional fundamental frequency extraction method is replaced by a data-driven trainable deep neural network model, and then a fundamental frequency extraction neural network is combined with a tone decoding neural network to finally form an end-to-end neural network tone classification model.
The invention divides an end-to-end tone recognition deep neural network model into two sub-networks: a fundamental frequency extraction network part and a tone mediation code network part. For the fundamental frequency extraction network, a fundamental frequency extraction model network of an encoder-decoder based on a recurrent neural network is provided. The network is divided into two parts, namely a base frequency encoder network and a base frequency decoder network. The baseband encoder network is a recurrent neural network that maps an input sequence to a sequence of hidden states of the same length given the input sequence. The base frequency decoder network is a feedforward neural network, and in order to predict the base frequency label at the current moment, the base frequency label at the next moment and the encoder hidden state at the current moment are required to be used as the common input of the base frequency decoder to determine the base frequency label at the current moment.
And after the fundamental frequency label of the whole sentence of voice is obtained, converting the obtained fundamental frequency label into a continuous sequence of fundamental frequency values through the mapping relation between the label and the fundamental frequency values. The mapping relation can adopt a nonlinear mapping relation of a fixed fundamental frequency label and a fundamental frequency value, and also can adopt a method called fundamental frequency embedding, adopt a trainable fundamental frequency pool to convert a predicted fundamental frequency label into a fundamental frequency value represented by a real number, and then send the fundamental frequency value output by the fundamental frequency prediction network into a tone decoder for decoding.
The tone decoder may employ a conventional deep feedforward neural network, a recurrent neural network, or a convolutional neural network, etc. A tone decoder predicts the tone of each syllable in a sentence of speech based on the output of the fundamental frequency extraction network. The invention provides a tone decoding network based on tone label correlation. The network is divided into two parts: the tones represent the network and the tone classification network to which the tags are related. And the tone representation network maps the variable-length fundamental frequency sequence corresponding to each syllable of the fundamental frequency extraction network into a vector with a fixed dimension. The label-dependent tone classification network predicts the type of the current tone based on the fixed dimensional representation of the current syllable and the label type of the previous syllable.
The specific implementation process is that each tone classification network predicts the label of each syllable in turn according to the sequence of each syllable: firstly, according to the fixed dimension representation of the 1 st syllable and the fixed dimension corresponding to a BOS (sentence start), splicing the fixed dimension representation and sending the fixed dimension representation and the fixed dimension representation into a tone classification network to predict the tone type of the 1 st syllable; and converting the predicted tone type of the 1 st syllable into a corresponding tone label, then sending the tone label and the fixed dimension representation of the 2 nd syllable as combined input into a tone classification network to obtain the tone type of the 2 nd syllable, and repeating the steps until the tone of the last syllable is predicted.
The invention aims at the tone recognition of the Mandarin Chinese, and improves the capability of automatically carrying out tone classification on each syllable in the voice from the original audio.
The description of the invention is used for acquiring the tone information contained in the voice, so that the acquired tone information can meet the requirements on tasks such as voice recognition, voice synthesis, voice conversion and the like, and the tasks can be realized more accurately. For some people who learn the non-native language of the second tonal language, such as foreigners learning Chinese, tone recognition will help correct pronunciation errors and improve the learning effect of the second language. The method can enable the tone recognition result to be more accurate than that of the traditional method, and has good effect in scientific research or daily application.
The technology of the invention overcomes the defects of the traditional two-stage tone classification method, adopts a data-driven neural network to replace the traditional fundamental frequency extraction algorithm, then combines the fundamental frequency extraction network and the back-end tone recognition network to form a uniform tone recognition network, and performs parameter joint tuning on the whole network on training data, thereby obtaining a better tone recognition result. An extraction network based on an encoder-decoder is designed for the fundamental frequency extraction network, so that the fundamental frequency characteristics can be better extracted. A context-dependent tone prediction network is designed for a tone decoding network, and tone recognition can be better performed.
Example 2
The scheme of example 1 is further described below with reference to specific examples, which are described in detail below:
firstly, training a tone recognition system model:
step 1: selecting a certain amount of Mandarin speech data as training data (also called samples) of a tone model;
and 2, step: constructing an acoustic model for training automatic speech recognition, and acquiring the starting time and the ending time of each syllable by using a forced alignment method under the condition of a given sentence content text;
and step 3: constructing an end-to-end tone recognition model, and determining each required hyper-parameter such as the number of layers of a neural network, the number of nodes of a hidden layer and the like;
and 4, step 4: performing necessary preprocessing on data according to input selected by an end-to-end tone recognition model;
and 5: sending the selected training voice data to a constructed end-to-end tone recognition model for training optimization to obtain optimized neural network model parameters, wherein the training speed depends on the configuration of a machine and the scale of the training data;
step 6: and continuously adjusting the parameters of the tone classification model neural network, continuously observing the result of the training model, selecting the optimal network model parameters, and storing the trained parameters of the tone classification model neural network.
Secondly, tone recognition is carried out:
step 1: obtaining a test voice, and obtaining the starting time and the ending time of each syllable by using a forced alignment method under the condition of a given sentence content text; under the condition of not giving the text of the sentence content, obtaining the starting time and the ending time of each syllable by using an automatic speech recognition method;
step 2: performing necessary preprocessing on data according to input selected by an end-to-end tone recognition model;
and step 3: and sending the selected training voice data into an end-to-end tone recognition model of the trained parameters for recognition, and finally obtaining the tone type of each syllable in each test data.
Example 3
The schemes in examples 1 and 2 are further described below with reference to specific examples and calculation formulas, which are described in detail below:
(1) training speech recognition acoustic models
Before proceeding with the pitch model task, speech data is used to train a speech recognition acoustic model based on a Gaussian mixture model-hidden Markov model (GMM-HMM) or a deep neural network-hidden Markov model (DNN-HMM). On the training data, the pronunciation label of each sentence is given, and the start time and the end time of each syllable corresponding to the speech input are obtained by using a speech recognition acoustic model and a forced alignment method in a speech recognition technology. On the test data, when the labeled content text of the sentence is not given, the syllable text corresponding to the voice input and the starting time and the ending time of each syllable are obtained by using a voice recognition decoding method. The phoneme boundary information of the phoneme segments can be obtained from the aligned phoneme segments and used as the boundary basis for tone recognition and classification.
(2) Preprocessing network
The input is the original waveform of the voice, the fundamental frequency feature extraction network can adopt the original waveform sampling, and at the moment, the input adopts 1024 voice samples. The input may also be normalized cross-correlation function coefficients calculated by normalizing the cross-correlation function.
The calculation method comprises the following steps: in the f-th frame, a window sequence w is cut out of the speech sequence f And window normalization is performed, then from w f In which a subsequence v of length n is truncated f,l Where l is a time lag index, representing v f,l At w f An offset of (2). Different normalized cross-correlation function coefficients are calculated from different time lag indices i.
The normalized cross-correlation function coefficients are calculated using the following formula:
Figure BDA0002324346050000071
wherein v is f,0 V for time lag index l equal to 0 f,l And A is a penalty factor set by manual experience.
(3) Training fundamental frequency feature extraction network
And extracting the fundamental frequency by using an artificial marking or a traditional fundamental frequency extraction algorithm, and using N fundamental frequency states corresponding to the fundamental frequency value as training labels of a fundamental frequency feature extraction network. Given the input value from the 1 st frame to the F-th frame base frequency network, the encoder RNN of the base frequency network recurrent neural network performs forward calculation, and the output (h) of the hidden layer obtained when the input value reaches the F frame is calculated 1 ,h 2 ,...,h F ):
h f =σ(Wx f +Vh f-1 +b) (2)
Wherein, the sigma is a Sigmoid function; w is to x f The transformation matrix of (2); x is the number of f Inputting a current f frame of the recurrent neural network; v is a pair of h f-1 The transformation matrix of (2); h is f-1 The output of the hidden layer of the f-1 frame; b is a bias vector.
In the decoding stage, one module in the fundamental frequency extraction network is called a fundamental frequency embedding pool. The embedding pool will predict the base frequency label
Figure BDA0002324346050000081
Mapping to a fundamental representation vector
Figure BDA0002324346050000082
Fundamental frequency label predicted by fundamental frequency extraction network according to f +1 moment
Figure BDA0002324346050000083
Conversion into corresponding fundamental frequency embedding via fundamental frequency embedding representation pool
Figure BDA0002324346050000084
Then outputs h with the hidden layer of the circulating network at the current moment f Splicing, and then obtaining a fundamental frequency output label of the f frame through a softmax layer:
Figure BDA0002324346050000085
wherein Z (·) represents an affine transformation. In the F-th frame, in order to calculate the fundamental frequency embedding expression at the moment when F +1 needs to be obtained, the fundamental frequency label at the F +1 time is needed according to the formula (3), and the fundamental frequency label exceeds the maximum range of the sentence fundamental frequency label, aiming at the problem, a label of EOS (sentence end) is adopted for the F +1 frame, the label corresponds to the maximum number of the fundamental frequency labels, and the whole fundamental frequency is taken to represent the last label in the embedding pool. After the base frequency label of the F frame is obtained through prediction according to the steps, the embedded representation of the F frame is found according to the lookup table of the base frequency embedded representation pool, and then the predicted label of the F frame is markedEmbedded representation of label and forward network hidden layer output h at F-1 moment F-1 Splicing, sending the baseband label to a softmax layer according to a formula (3) to obtain a fundamental frequency label at the F-1 moment, and sequentially iterating until the F-2 moment, the F-3 moment, … and the F moment are traced back until the 1 moment.
When the fundamental frequency extraction network is trained, a teacher forcing (teacher forcing) method is adopted to train the convergence speed and the training effect. The teacher forced method is to replace the actual labeled f +1 frame pitch frequency label of formula (3) with formula (4) in training
Figure BDA0002324346050000086
Is converted into
Figure BDA0002324346050000087
And f implicit State h of frame f Predicting fundamental frequency tag at time f
Figure BDA0002324346050000088
Figure BDA0002324346050000089
In the implementation process, a direct training and teacher forcing method is adopted for random selection, and the random coefficient is set according to experience, in the example, the random coefficient is set to be 0.5. In the training stage, given the waveform input of the whole sentence, the posterior probability of the fundamental frequency label at each moment f is output through the network:
Figure BDA0002324346050000091
and optimizing network parameters according to the fundamental frequency marks of each frame by using mutual entropy as an objective function and using a back propagation algorithm and a random gradient descent method.
When the fundamental frequency feature is extracted by the neural network based on the trained fundamental frequency, the fixed network parameters are unchanged, the original or normalized cross-correlation function coefficient of the audio waveform is input to predict the fundamental frequency label value of each time frame, and the actual fundamental frequency predicted value is obtained according to the corresponding relation between the label and the fundamental frequency.
(4) Tone decoding and identifying network
Tone decoding identification network as shown in fig. 4, the network is a tone label related forward decoding network. The network comprises two parts: one representing the network for syllable tone embedding and one decoding the network for context dependent tones. The syllable tone embedding representation network is characterized in that: converting the variable-length base frequency sequence of each syllable into a vector with fixed dimensionality according to the boundary information of each syllable by the base frequency output by the base frequency extraction network, wherein the specific implementation mode is as follows: and extracting the output of the fundamental frequency extraction network according to the boundary information of each syllable, splicing the output according to the front and back 9 frames, and sending the output into a syllable tone embedding representation network to obtain the embedding representation of the fixed dimension of each syllable.
The context-dependent tone decoding network is characterized in that the tone label of each syllable in the speech segment is predicted by using the fundamental frequency value predicted by the traditional fundamental frequency extraction characteristic or the fundamental frequency extraction neural network. The tone-embedded representation pool includes vector representations corresponding to 6 tone labels, where 5 embedded vector representations correspond to 5 Chinese tones and another embedded vector representation represents a beginning tone of a sentence. Wherein, the tone embedding pool is also optimally trained in a back propagation algorithm.
Embedding the beginning tone of the sentence into the vector when predicting the tone type of the beginning (1 st) syllable
Figure BDA0002324346050000092
Embedded representation with current syllable
Figure BDA0002324346050000093
Splicing and sending the sound tone classification network to perform classification prediction to obtain the sound tone type of the 1 st syllable
Figure BDA0002324346050000094
Embedding the tone corresponding to the 1 st tone type into the vector
Figure BDA0002324346050000095
Embedding with the 2 nd syllable
Figure BDA0002324346050000096
Tone label indicating phase connection to be sent to tone prediction network to obtain 2 nd syllable
Figure BDA0002324346050000097
Predict the 3 rd in turn until the last syllable.
(5) End-to-end tone recognition neural network model
The end-to-end tone recognition neural network model is formed by connecting a fundamental frequency extraction network and a tone classification network together, and network parameters are optimized simultaneously. And converting the fundamental frequency labels of the previous and next 9 frames of the current frame output by the fundamental frequency extraction network into fundamental frequency embedding of 9 frames by searching N fundamental frequency embedding pools to be used as the input of the tone model. Or 9 frames of splicing are carried out by using the weighted fundamental frequency value as a rear end method, and a tone recognition result better than partial parameter tuning is obtained through integral parameter tuning.
In conclusion, the method has the advantages of improving the accuracy of mandarin tone recognition, reducing the calculation time, breaking through the traditional frame that the tone recognition is divided into two stages (fundamental frequency feature extraction and tone classification), and constructing an end-to-end tone model. The end-to-end model can jointly optimize the feature extraction stage and the classification stage as a whole network, so that the accuracy of the tone recognition of the Mandarin Chinese can be obtained, the robustness of the network model is good, and the method is suitable for tone research of tonal languages.
In the embodiment of the present invention, except for the specific description of the model of each device, the model of other devices is not limited, as long as the device can perform the above functions.
Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (3)

1. An end-to-end tone recognition method based on a neural network, the method comprising:
firstly, training a tone recognition system model:
constructing an end-to-end tone recognition model, and determining parameters required by the number of layers of a neural network, the number of nodes of a hidden layer and the like;
training a voice recognition acoustic model on a training set, and acquiring the starting time and the ending time of each syllable by using forced alignment;
sending the selected training voice data and the tone label of each syllable into an end-to-end tone recognition model for training optimization to obtain optimized neural network model parameters;
continuously adjusting parameters of the neural network model, and selecting optimal parameters of the network model;
secondly, tone recognition:
obtaining a test voice, and obtaining the starting time and the ending time of each syllable by using forced alignment under the condition of a given sentence content text; when not given, obtaining the start and end time of each syllable using automatic speech recognition;
sending the selected test voice data into an end-to-end tone recognition model for recognition, and finally obtaining the tone type of each syllable in each test data;
the method comprises the steps of constructing a trainable deep neural network model, combining a fundamental frequency extraction neural network with a tone decoding neural network to form an end-to-end neural network tone classification model, and training and optimizing network parameters of the two parts at the training stage;
the fundamental frequency extraction neural network is an encoder-decoder based on a cyclic neural network, and the network is divided into a fundamental frequency encoder network and a fundamental frequency decoder network;
the base frequency encoder network encodes the voice by utilizing a recurrent neural network, the base frequency decoder network predicts a base frequency label from the last frame of the voice, converts the predicted base frequency label into a trainable base frequency embedding vector according to the predicted base frequency label, and determines the base frequency label at the current time by taking the base frequency label at the next time and the encoder hidden state at the current time as the common input of a base frequency decoder until the base frequency label of the first frame is predicted;
after the base frequency label of each frame is predicted, the mapping relation between the label and the base frequency value defined in advance is used for converting the base frequency label into the base frequency value sequence of the whole voice.
2. The end-to-end tone recognition method based on the neural network as claimed in claim 1, wherein the tone decoding neural network is divided into two parts: tone representation networks and label-dependent tone classification networks;
the tone representation network maps the predicted fundamental frequency value sequence into a vector with fixed dimensionality according to each syllable;
the tone classification network predicts the tone type of the current syllable based on the tone label predicted for the last syllable and the fixed dimensional vector of the current syllable.
3. The end-to-end tone recognition method based on neural network as claimed in claim 2, wherein the tone classification network predicts the tone type of the current syllable according to the tone label predicted by the previous syllable and the fixed dimension vector of the current syllable, specifically:
firstly, according to the fixed dimension representation of the 1 st syllable, splicing with the fixed dimension corresponding to the beginning of a sentence, and sending the fixed dimension representation and the fixed dimension representation into a tone classification network to predict the tone type of the 1 st syllable;
converting the predicted tone type of the 1 st syllable into a corresponding tone label and then sending the corresponding tone label and the fixed dimension representation of the 2 nd syllable as combined input into a tone classification network to obtain the tone type of the 2 nd syllable;
this is repeated until the tone of the last syllable is predicted.
CN201911310349.4A 2019-12-18 2019-12-18 End-to-end tone recognition method based on neural network Active CN111063335B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911310349.4A CN111063335B (en) 2019-12-18 2019-12-18 End-to-end tone recognition method based on neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911310349.4A CN111063335B (en) 2019-12-18 2019-12-18 End-to-end tone recognition method based on neural network

Publications (2)

Publication Number Publication Date
CN111063335A CN111063335A (en) 2020-04-24
CN111063335B true CN111063335B (en) 2022-08-09

Family

ID=70302281

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911310349.4A Active CN111063335B (en) 2019-12-18 2019-12-18 End-to-end tone recognition method based on neural network

Country Status (1)

Country Link
CN (1) CN111063335B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113539268A (en) * 2021-01-29 2021-10-22 南京迪港科技有限责任公司 End-to-end voice-to-text rare word optimization method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102938252A (en) * 2012-11-23 2013-02-20 中国科学院自动化研究所 System and method for recognizing Chinese tone based on rhythm and phonetics features
CN103489446A (en) * 2013-10-10 2014-01-01 福州大学 Twitter identification method based on self-adaption energy detection under complex environment
CN107492373A (en) * 2017-10-11 2017-12-19 河南理工大学 The Tone recognition method of feature based fusion

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8756061B2 (en) * 2011-04-01 2014-06-17 Sony Computer Entertainment Inc. Speech syllable/vowel/phone boundary detection using auditory attention cues
US20190130896A1 (en) * 2017-10-26 2019-05-02 Salesforce.Com, Inc. Regularization Techniques for End-To-End Speech Recognition

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102938252A (en) * 2012-11-23 2013-02-20 中国科学院自动化研究所 System and method for recognizing Chinese tone based on rhythm and phonetics features
CN103489446A (en) * 2013-10-10 2014-01-01 福州大学 Twitter identification method based on self-adaption energy detection under complex environment
CN107492373A (en) * 2017-10-11 2017-12-19 河南理工大学 The Tone recognition method of feature based fusion

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
A convolutional representation for pitch estimation;Kim J;《Proc. of ICASSP》;20181231;全文 *
MANDARIN TONE MODELING USING RECURRENT NEURAL NETWORKS;Hao Huang;《arXiv》;20171106;全文 *
基于BP 网络的汉语普通话声调识别;李仕强;《南京信息工程大学学报 自然科学版》;20120512;第456-459页 *
神经网络在语音音调识别中的应用研究;张振国;《微电子学与计算机》;20050312(第3期);第43-49页 *

Also Published As

Publication number Publication date
CN111063335A (en) 2020-04-24

Similar Documents

Publication Publication Date Title
CN111739508B (en) End-to-end speech synthesis method and system based on DNN-HMM bimodal alignment network
CN111883110B (en) Acoustic model training method, system, equipment and medium for speech recognition
CN111640418B (en) Prosodic phrase identification method and device and electronic equipment
CN112002308A (en) Voice recognition method and device
Zhu et al. Phone-to-audio alignment without text: A semi-supervised approach
CN112802448A (en) Speech synthesis method and system for generating new tone
CN109377981B (en) Phoneme alignment method and device
CN112349289B (en) Voice recognition method, device, equipment and storage medium
CN112466316A (en) Zero-sample voice conversion system based on generation countermeasure network
CN111883176B (en) End-to-end intelligent voice reading evaluation method
CN112735404A (en) Ironic detection method, system, terminal device and storage medium
CN116229932A (en) Voice cloning method and system based on cross-domain consistency loss
CN111915940A (en) Method, system, terminal and storage medium for evaluating and teaching spoken language pronunciation
CN111063335B (en) End-to-end tone recognition method based on neural network
CN112967720B (en) End-to-end voice-to-text model optimization method under small amount of accent data
CN113539268A (en) End-to-end voice-to-text rare word optimization method
CN117079637A (en) Mongolian emotion voice synthesis method based on condition generation countermeasure network
CN113257221B (en) Voice model training method based on front-end design and voice synthesis method
CN116306592A (en) Senile dementia scale error correction method, system and medium based on reading understanding
Li et al. Improving mandarin tone mispronunciation detection for non-native learners with soft-target tone labels and blstm-based deep models
CN115169363A (en) Knowledge-fused incremental coding dialogue emotion recognition method
CN114446324A (en) Multi-mode emotion recognition method based on acoustic and text features
CN116386637B (en) Radar flight command voice instruction generation method and system
CN115376547B (en) Pronunciation evaluation method, pronunciation evaluation device, computer equipment and storage medium
CN112530414B (en) Iterative large-scale pronunciation dictionary construction method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant